Chapter 5. Working on a Unix System

Unix has a wealth of functions, and you'll want to be aware of a particular subset of them before you start running programs and collecting data. In Chapter 4, we talked about how to organize and manage your files in Unix, as well as how to move around the filesystem. In this chapter we take you on a whirlwind tour through the common Unix commands you'll need to know to work efficiently. We discuss the Unix shell itself, issuing commands in Unix, viewing, editing, and extracting information from your files, shell scripts, and working in a multiuser environment.

Once you've learned to use some of these Unix commands, you'll find that they are astonishingly powerful and flexible, allowing you to modify files in ways that are impossible, or at least not easy, with a conventional word-processing program. For example, with a single command you can find all the instances of a pattern in every file under your home directory. A few simple tricks can create a script that will process every file in your source data directory identically. Another simple script can update a customized local copy of a database every night while you're sleeping.

The Unix Shell

When you log into a Unix system or open a new window in your system's window manager interface, the system automatically starts a program called a shell for you. The shell program interprets the commands you enter and provides you with a working environment and an interface to the operating system. It's possible to work in Unix without the shell using graphical file manager tools, but you'll find that many shell commands are useful for data processing and analysis. Entire books devoted to the various shells are available, and the manpages for some of the common shells exceed 100 pages when printed. We provide you only with a brief introduction to the commonly used shells, to get you started with as few hurdles as possible.

What Flavors of Shell Are There?

The shell program you use affects the feel of your command-line interface. Some of the features that can be built into the shell program include a simple arithmetic interpreter that lets you use the command line as a calculator; command aliasing, which lets you refer to standard Unix commands with other more convenient words; filename completion, which lets you type only the number of characters necessary to distinguish a file from other files in the directory, rather than typing the full filename; command editing and command history, which let you scroll back through the commands you've recently issued and edit them on the command line; spelling correction; and help functions for the shell program.

There are a number of common shell programs on Unix systems. You are automatically assigned a shell when your system administrator sets up your account. On Linux systems, the default shell program is the bash (Bourne Again) shell. However, you may prefer to use a shell other than bash. The two main classes of shell programs are shells derived from the Bourne shell, sh, and shells derived from the C shell csh. Bourne-type shells include sh, bash, ksh (the Korn shell), and zsh (the Z shell). C-type shells include csh and tcsh.

We tend to prefer C shells, for historical reasons. When we started working in Unix, the C shell was the best thing going, and the tcsh program has expanded the original csh into a powerful shell. tcsh implements most of the desirable shell features, including history, command aliasing, filename completion, command-line editing, arithmetic and functions, job control, and spelling correction. tcsh is also one of the most user-configurable shells. Therefore, we'll discuss the behavior of Unix commands from a C-shell perspective, as if you were using the tcsh program, which we use on our machines.

Your default shell will be listed as the last item in your entry in the /etc/passwd file. If you aren't certain which shell you are currently using, you can find out by typing:

% finger your-user-name

For user jambeck, this command shows the following information:

Login name: jambeck           In real life: Per Jambeck 
Directory: /home/jambeck      Shell: /bin/tcsh

This tells us that he is using tcsh as his default shell. For practical reasons, we will limit our discussions and most references to csh and tcsh. It must also be noted that many system processes (e.g., batch, at, and cron) use the Bourne shell by default, which makes it necessary to learn at least a minimal subset of its command language. On most systems there are commands to change your default shell as set in the passwd file. The chsh (change shell) command allows you to change your default login shell, if you're working on a Linux system.

Issuing Commands on a Unix System

There 's a standard format for sending an instruction to Unix. In this book, we'll refer to commands and to the command line. Each of Unix's many native commands has a tangible existence as an executable program, and to issue the command is to tell Unix to execute that program. In this section and those that follow, we move fairly quickly through concepts and commands. While we can give you a brief overview of the Unix features we find most useful, this book isn't designed to replace a comprehensive Unix reference book. If you're new to Unix, we strongly recommend that you review the basics of Unix with the help of books such as Learning the Unix Operating System, Running Linux, or Unix for the Impatient. We've provided a list of recommended reading in the Bibliography.

The Command-Line Format

The command line consists of the command itself, optional arguments that modify how the command works, and operands such as files upon which the command operates. For example, the chsh (change shell) command, which we just discussed briefly, has several possible options. The first is the -s option, which must be followed by the name of a shell program as its argument. The second is the -l option, which needs no argument, and which lists the shells that are available on your system. The operand for the chsh command is the username of the user whose shell is being changed. So, to change your default shell program, you might first type:

% chsh -l

which gives you a list of the shell programs available on the system:

/bin/bash 
/bin/sh 
/bin/ash 
/bin/bsh 
/bin/bash2 
/bin/tcsh 
/bin/csh 
/bin/ksh 
/bin/zsh 

Then, to actually change your shell to tcsh, you can type:

% chsh -s /bin/tcsh yourusername

Options can simply be single-letter codes, or they can have their own arguments. Options that take no arguments can be given as a group, while each option that takes an argument must be specified separately. Each option group and separate option must be preceded by a hyphen (-). The last option in a group, or separate options, can be followed by the option argument. The operands follow the final option in the list.

Many Unix commands have options that, frankly, you'll never use. And we're not going to talk about them. But there are ways of finding out more.

Unix Information Commands

Unix has its own built-in reference manual, which is quite comprehensive and informative, and which will give you the correct information about the commands and options available on the particular system you're using.

The man command is one of the most useful Unix commands; it allows you to view Unix manual pages. While some Unix systems have implemented a web browser-like interface to the Unix manpages, you can't always count on this option being available. The man command is available on all types of Unix systems.

Usage: man name

where name can be a Unix command, such as grep, or a system file, such as the password file /etc/passwd.

If you're not sure of the command you're looking for, you can sometimes find the right information using man's slightly smarter cousin, apropos. The apropos command locates commands by keyword lookup.

Usage: apropos name

For instance, if you're concerned about disk usage on your system, you can enter apropos usage. The output of this command on our PC running Red Hat Linux is:

du (1)	- summarize disk usage 
getrlimit, getrusage, setrlimit (2)	- get/set resource limits and usage
quota (1)	- display disk usage and limits
quotacheck (8)	- scan a file system for disk usages

apropos doesn't always produce such brief and informative output. Entering a smart combination of keywords is (as always with such searches) the key to getting the output you want. If you want a predictable listing of Unix commands, it's probably best to pick up a comprehensive Unix book.

What should you do if you find the following text in a manpage?

This documentation is no longer being maintained and may be inaccurate or 
incomplete. The Texinfo documentation is now the authoritative source. 

The GNU[*] set of Unix tools are adopting a documentation system, called texinfo, that is different from the traditional man system. If you come across this message, you should be able to read the up-to-date documentation on the program by typing in the command info progname. For instance, info info gives you a complete set of documentation on the use of info and even provides instructions for creating your own info documentation when you start writing your own programs.

Standard Input and Output

By default, many Unix commands read from standard input and send their output to standard output. Standard input and output are file descriptors associated with your terminal. A program reading from standard input will simply hang out and wait for you to type something on your keyboard and press the Enter key. A program writing to standard output spews its output to your terminal, sometimes far faster than you can read it.

Some Unix commands read a hyphen (-) surrounded by whitespace on either side as "data from standard input." This construct can then be used in place of a filename in the command line. Absence of an output filename is sufficient to cause the program to write to standard output.

Redirection of Command Input and Output

The standard input and output descriptors are useful because you can redirect both standard input and output, associating them with filenames, with no effects on the functioning of the program. Here are the most common redirection constructs used by the C shell:

<

This redirector preceding a filename associates that filename with standard input, i.e., the contents of the file are presented to the program as if they are standard input.

>

This redirector associates a filename with standard output, so that the filename is created on execution of the command, or whatever is in an existing file of that name is overwritten by the output of the command.

>>

This redirector associates a filename with standard output. It differs from > in that the output of the command is appended to the end of the existing file.

The cat command reads the contents of a file and writes them to standard output. If you want to use the cat command to combine the contents of three files into one new file, you can use a redirector like this:

% cat file1 file2 file3 > file4 

This construct with cat would be useful if, for example, you'd just downloaded a bunch of individual sequence files from the NCBI web site and want to collect them into one large file that can be read by another program. (This is an example of something that seems like it should be simple, but is actually time-consuming and annoying to do with a standard PC word-processing program. Unix provides a neat solution that doesn't even require you to open any files).

You can also use redirectors to direct the contents of a file into a program at run-time, as standard input (useful if you are running a program that prompts you for input from the keyboard) or to capture output from a program that is normally written to standard output:

program < inputfile
program > outputfile

For example, let's say you've just finished an extensive BLAST search, and you want to send the results to your colleague. You can use the redirector < ("less than"), to scoop the file huge_blast_report out of your directory and mail it directly to your colleague:

% mail [email protected] < huge_blast_report

If you want to increase the chances of your colleague opening the message, you can add a subject header to the mail message using the mail option -s. The command reads:

% mail -s "surprise!" [email protected] < huge_blast_report

The reverse operation, sending the results of standard output (or text that's displayed on your screen) to a file, can be accomplished using > ("greater than"). Perhaps your colleague wants to write a quick reminder to herself to reply to your mail. She could do it using the cat command to take input from the keyboard and redirect it to a file, like this:

% cat > reminder_to_self
Ha! Send fifteen BLAST reports to colleague on Monday.
^D
%

Ctrl-D (^D) signals that you have finished entering text. Your colleague now has a file called reminder_to_self in her current working directory.

Operators

Operators are similar to redirectors in that they are ways of directing standard input and output. However, they direct input and output to and from other commands rather than to filenames.

The most commonly used operator is the pipe (|). The pipe directs standard output of one command into standard input for the next command. This allows you to chain together several different filtering commands or programs without creating input or output files each time.

You can use the cat command to direct the contents of a file into a program that reads information from standard input:

% cat inputfile | program

This command construct does the same thing as the example we showed earlier (program < inputfile). Both cause the output of the cat command to act as input for program. If you want to do a lot of runs of the same program using slightly different input, you can create multiple input files and then write a script that cat s each of those input files in turn and pipes their contents to program.

Pipes can carry out a complete set of file-processing options without writing to disk. For instance, imagine that you have a datafile consisting of multiple tables concatenated together. The first table in the file takes up the first 67 lines, the second table takes up the next 100 lines, and the rest of the file is taken up by a third table.[†] You want the information that's contained in the second column of the middle table, which stretches from characters 30 -39 in the row. Using filters and pipes, you can construct the following command to crop out the data you need:

% head -167 protein1.pka | tail -100 | cut -c30-39 > protein1.pka.data 

In this example, head sends the top 167 lines of a specified file or files (in this case protein1.pka) to standard output; tail takes the last 100 lines of the output of head; and cut takes the correct column of characters out of the results of head and tail and then stores it in protein1.pka.data.

Wildcard Characters

A useful construct Unix shells recognize is the presence of wildcard characters in filenames. The shell locates matches for any wildcards before passing filenames on to the program. The two most commonly used wildcards are the asterisk (*) and the question mark (?). * means "any sequence of zero or more characters, except for the / character." ? means "any single character." Thus, "every file in this directory" can be denoted by a lone *, which is a useful shortcut.

The shell recognizes other wildcards as well. The construct [cset ] refers to any characters in the specified set. If you want to move all files beginning with letters a through m to a new directory, you can structure the command as mv [a -m]* ../newdir. If you want to move all files beginning with a number to a new directory, enter mv [0 -9]* ../newdir.

Running X Commands

On Unix systems running the X Window System, there are many commands available that initiate programs with functions that aren't command line-based. Once these programs, which can include anything from graphics viewers to complicated scientific applications, are called from the command line, they use the X Window System to open their own windows, which generally contain a complete, independent graphical user interface.

Viewing and Editing Files

You're probably accustomed to the idea of using a program to open a file. If your first introduction to computers has been sometime in the last 15 years, you're probably used to simply clicking on a file icon, which is automatically recognized by the right piece of software, which opens the file.

In Unix, commands are designed to operate on files that are sensibly readable and printable as text whenever possible. Thus text files can be opened by a wide variety of commands that allow a great deal of flexibility in file manipulation. The file reading and processing commands have such functions as sorting data based on the value of a particular substring in each line of the file, cutting a particular column out of a file, pasting columns of data together side by side, checking to see what the differences between two files are, and searching for instances of a pattern in a file or group of files. Often, these simple commands are all you need to extract a desired subset of the data in a file and prepare it for analysis.

Unix has many ways to view and edit the contents of files. There are viewers for text and programs that allow you to examine the contents of binary files, as well as full-featured editors for modifying plain-text files.

Viewing and Combining Files with cat

Usage: cat -[ options ] files

cat dumps the contents of a file onto the screen. If your file is short, or if you've successfully completed a speed-reading course, this utility works well. If you need to see what's on each page of a file, though, cat is less useful, since the contents of the file scroll by without pausing.

Instead of viewing text, cat is most useful for combining (or concat enating) files. For instance, if you have a series of files of program output named meercat1.txt, meercat2.txt, and meercat3.txt, and you want to combine them into a single file, you can type:

% cat meercat1.txt meercat2.txt meercat3.txt > big-meercat.txt 

This command appends the contents of meercat3.txt to the end of meercat2.txt, the contents of meercat2.txt to the end of meercat1.txt, and so on, combining them into one big file named big-meercat.txt. If you've thought to number the outputs sequentially (as we have with the meercats), and want them in that order in the file, you can just type:

% cat meercat*.txt > big-meercat.txt

and it will have the same effect. Wildcard characters such as * use a strict alphabetical order: if they exist, files meercat10.txt and meercat11.txt come before meercat2.txt.

cat can also append files to the end of an existing file. For example, if your program generates another output file you need to attach to the end of the collection, the command:

% cat meercat10.txt >> big-meercat.txt

does just that. If you use > instead of >> in this situation, instead of being added at the end of the file, the new file meercat.txt overwrites the entire contents of big-meercat.txt.

Incidentally, if you want a command that's the reverse of cat to print the lines of a file in backward order, you're in luck: the command is called tac. Sadly, the command acta, for printing a file inside out, hasn't yet been implemented.

more: A Step in the Right Direction

Usage: more -[ options ] [+ linenumber ] [+/ pattern ] filename

more is a pager, which in Unix means a program that lets you view a file one page at a time. Suppose you have a file containing BLAST output named blast-first.txt. Typing:

% more blast-first.txt 

shows you the first page of the file blast-first.txt, and steps forward one page every time you press the space bar. To leave more, hit the q key; to view other more commands while within more, enter h.

more is smart about moving around files. If you know where you want to go in the file, you can specify the line number (using the + linenumber option). If, on the other hand, you want to start at the first occurrence of a certain word or pattern, use the +/ pattern option. When viewing a file in more, if you press the / key and then type a pattern to search for, more jumps to the next occurrence of that pattern in the file and repeats searches for each subsequent occurrence of that pattern every time you press / followed by the Enter key.

Here are some other useful options for more :

-r

Shows normally unprintable control characters as well as normal text

-s

Squeezes multiple empty lines into a single one

You can redirect the output of a program that generates more than a screen's worth of text to more, allowing you to page through the output one screen at a time. Let's say you want to know who is logged into your Unix system. If enough users are logged in, the output scrolls off the screen. By piping who to mor e :

% who | more

you can scroll through the output line-by-line using the Return key or screen-by-screen using the space bar.

more 's most significant shortcoming is that some versions can't move backward through a file. less is a utility that remedies this simple problem.

less: The Gold Standard

There is a superior pager command, less . Most importantly, less rectifies more 's biggest flaw: it lets you page backward as well as forward in a file. less also doesn't load a file into memory all at once, which makes it less likely that your computer will grind to a halt if you view a huge file with it. Finally, it also handles binary files more gracefully, displaying readable text as characters and representing unreadable control characters in the form ^X. less uses the same options as more, but it also takes additional options. Be sure to check info less to see which ones your local version takes. And finally, while it hardly bears mentioning, why is it called less ? Because less is more. Sigh.

Editing Files with vi and vim

Usage: vim filename

Because it's a text-based operating system that has historically been used for software development and computation, Unix did not traditionally provide the kind of full-featured, "what you see is what you get" text editing that exists on personal computers, although now such editors are available. In fact, WYSIWYG text editors are of limited utility for programmers because they often introduce invisible markup characters into documents.

It's worth learning to use the plain-text editors that are provided for Unix. They have a fairly steep learning curve, but they are the right tools for the job if you're writing programs or looking at plain-text data. If you download sequence data from a web server and open and work with it in a plain-text editor, the file you write out should be readable by a sequence-analysis program. If you opened the same file and worked with it in a WYSIWYG editor, then wrote it out in the file format used by that editor, it would be unreadable by other programs.

The vi editor is a standard feature of most Unix systems. It's a full-screen editor; it allows you to see as many lines of the file that you are editing as will fit into the terminal screen or window in which you run it. The cursor can be moved through the file using keyed instructions, but it can't be moved with the mouse. The bottom line on the screen is called the status line. Error messages from vi appear in the status line.

In Section 5.6, we discuss the use of regular expressions for searching and replacement as a feature of the plain-text editor vi. The ability to use vi with the regular-expression language makes vi a powerful tool for file manipulation.

A few nice features have been added to vi in vim (vi improved). It's worth asking your system administrator to install vim if it's not already on your system, if only for the multiple undo feature that it introduces. We can't cover all the features of vim here, but we will present a few commands that will get you up and running.[‡]

vim has three modes ; in each, input from the keyboard is interpreted differently:

Command

This is the main mode; you are automatically in command mode when you start working. Keystrokes are interpreted as vim's short commands, most of which consist of one or two letters. You can always return to the command mode by hitting the Escape key once (or sometimes twice).

Input

This mode is reached by issuing any command that requires input.

Status line

This mode is for issuing longer, more complex commands. To reach status line mode, simply type a semicolon (;) in command mode. A semicolon appears at the left side of the status line, and anything you type appears in the status line. When you finish typing your command and hit the Enter key, the command is executed, and you return to command mode.

Here are some of the most useful vim command-mode commands:

h, j, k, l

Moves the cursor around in your file character-by-character or line-by-line. It's sort of like a pre-joystick video game: "h" moves you to the left, "l" to the right, "j" moves you down a line, and "k" up a line. On most systems, the arrow keys on your keyboard will also work to move you around within vim.

w, b

Moves the cursor forward ("w") or back ("b") by one word in the text. Words are delimited by whitespace.

), (

Moves the cursor forward ")" or back "(" by one sentence in the text. Sentences are recognized as sequences of words terminated by an end-of-sentence character (. ? !).

a, A, i, I, o, O

Initiates the insertion of text. "a" and "A" insert text after the cursor and at the end of the current line, respectively. "i" and "I" insert text before the cursor and at the beginning of the current line. "o" and "O" open a blank line below or above the current line, respectively, and begin inserting text on the new line.

x, X

Deletes the text under the cursor or before the cursor, respectively. Preceded by an integer number, they delete that number of characters after or preceding the cursor.

s, S

Substitutes for the character under the cursor or for the current line, respectively, by deleting the character either under the cursor or the line and initiating insertion of text in place of the deleted character. Preceded by an integer number, "s" replaces that number of characters with the new text, and "S" replaces the specified number of lines.

Here are some of the most useful vim status line mode commands:

:wq

Saves changes to the file and quits the editing session. ":w" can be used by itself or with the name of the file to write to. ":q!" exits the session without saving changes.

r]

Followed by a filename, inserts the entire text of the named file.

:g/pattern/s//replacement/g

Searches for and replaces pattern with replacement throughout the buffer. If the trailing "g" is left off, only the first occurrence of the pattern in any line is replaced.

:number

Moves the cursor to the specified line number.

The GNU Emacs Editor

vim is a fairly flexible editor, and you can certainly learn to make it do any text-editing task that you need to do. However, there are other options for text editing on Unix systems. The best of these is probably the Emacs editor. Emacs is an editing program made available by the Free Software Foundation. It contains not only a text-editing facility with special modes for TEX and LaTEX documents, programs in various programming languages, and outlines, but also a file manager, mail and news readers, and access to the online documentation browser info. Whole books have been written on Emacs (see the Bibliography) so we won't go into it here except to recommend that, if you're working on a Unix system, learning to use Emacs is one of the better uses of your learning-curve time.

Viewing Binary Files with strings

Usage: strings -[ options ] filenames

In addition to the text files we've discussed up to now, there are also binary files that can't be read as text. They are almost always the output of a program or the executable form of a program itself (as opposed to the source code). Binary files and program executables aren't human-readable because they are in machine language. Because of this language gap, we'll unflinchingly make the prediction that, 9 times out of 10, it isn't worth the effort needed to read binaries. You'll have more luck taking another route, like talking to the person whose program created the file in the first place. Unfortunately, many programs today, such as commercial hidden Markov model software or data mining programs that directly write their internal representation of data structures to disk, use binary files to store proprietary data structures. For that tenth time, then, we present some tips on how to extract information from binaries without going crazy.

Your first step should be to use either the less command described earlier or the strings command. If any portions of the file are in plain text, they will be readable in less. The strings command cuts out any readable text characters in the file and prints them to the screen. For example, if you have an undocumented binary file named badger and want to see if it contains any clues as to what it does, try typing:

% strings -n 3 badger | less

(The -n option tells strings how many readable text characters in a row constitute a string. The default setting for -n is four). Piping the output to less will let you page through it if it's longer than one screen. If the output looks like:

ATCGTACTGATCGTCGATCGTCGATCATGCA
CGTAGCAGTCGATCATCATCGTACTAGCTAG
ATGCCTGAGCTATACACACTAGTCACGATGC

you might guess that badger contains some kind of binary encoding of data including a nucleotide sequence or a (not good) multiple sequence alignment.

od and Binary Data

Usage: od -[ options ] filenames

Sometimes, it may be necessary to do more than just identify a binary file. In these cases, the od program may provide a first step in understanding the file's contents. Before looking at od itself, let's take a quick detour through the ways in which binary information is represented in a moderately more human-friendly form.

Rather than using conventional decimal (base-10) notation, binary data is usually represented using a base that is a power of two: either octal (base-8) or hexidecimal (base-16) digits. Octal numbers are usually preceded by a 0. For example, the decimal number 25 corresponds to octal 031.[§] Hexidecimal digits, on the other hand, are usually preceded by a 0x and use the letters A through F to represent the decimal numbers 10 through 15. The decimal number 25 is 0x19 in hexidecimal.

If you want to delve into the heart of the binary file and see what's going on, you can use the od command to perform an octal dump (or hex dump) and see if your binary file is readily interpretable. Typing:

% od -c badger | less

creates an octal dump of badger you can step through a page at a time. It should look something like this:

0000000       001           006   R   T   D   C   Y   G  
0000020     006   R   T   D   C   Y   G        a      
0000040 001   1       003   A   R   G       001      
0000060 002   C   A       002   B   Z 270   R   ? 200       @

o d 's primary options are:

-c

Prints out text characters corresponding to bytes

-x

Produces a hex dump of the file

-o

Produces an octal dump (the default setting)

-d

Produces a dump of unsigned decimal numbers

Unless you're a serious programmer, you're not likely to have to read binaries. However, on the off chance that you do, we hope these standard tools will help you start to get your questions answered.

Transformations and Filters

Filters are programs that take input data and transform it to produce output. They can accomplish tasksk —such as extracting parts of files—that word processing and spreadsheet applications can't. A transformation involves a simple manipulation of the data format, or selection of specified lines or fields from the data. In this section, we discuss some of the more commonly used filters that are part of Unix. These filters can read from standard input and writing to standard output, allowing you to combine them and produce fairly complex transformations.[‖]

Extracting the Beginning of a File with head

Usage: head - number files

Say you have a program that spits out a lengthy datafile that has several different tables of information concatenated together. Leaving aside the question of why anyone would write a program that creates such difficult output, there are commands that allow you to work with such data, and you need to know them. head is one such command.

By default, head sends the top 10 lines of the specified file or files to standard output. Checking the head of a file this way is an easy way to see if there's something in the file without opening it using an editor or doing a full cat of the file.

With the -number flag, head becomes a tool for selecting a specified number of records from the top of a file. Combinations of head and tail commands can extract any set of lines from a file provided that you know their location in the file.

Extracting the End of a File with tail

Usage: tail [-f] - number files

The tail command outputs the last 10 lines of a file by default, or the last num lines of the file if specified. With the -f option, tail gives constantly updated output of the last few lines in the file. This provides a way to monitor what is being added to a text output file as it's written by another program.

Splitting Files with split and csplit

Usage: split -[ options ] filename
Usage: csplit -[ options ] file criteria

The split command allows you to break up an existing file into smaller files of a specified size. Each file produced is uniquely named with a suffix (aa, ab...az, ba, etc.). The options to split are:

-l lines

Splits the file into subfiles of length lines

-a length

Uses length letters to form suffixes

If you have a file called big-meercat.txt and you want to split it into subfiles of length 100 lines using single-letter suffixes and writing the files out to subfiles named meercat.*, the command form of the split command is:

% split -l 100 -a 1 big-meercat.txt meercat. 

csplit also splits files into subfiles, but is somewhat more flexible than split, because it allows the use of criteria other than number of lines or bytes for splitting. Here are csplit 's options:

-f prefix

Uses the specified file prefix to form subfile names

-n length

Uses suffixes of a specified length to form subfile names; subfile suffixes are made up of numbers rather than letters

Split criteria are formed in two ways: either a regular expression is supplied as the criterion, possibly modified by an offset, or a number of lines can be specified.

A biological sequence database in FASTA format may contain many records of the form:

 >identifying header information 
PROTEINORNUCLEICACIDSEQUENCEDATA 

The csplit command can split such a database into individual sequence files using the command:

% csplit -f dbrecord. -n 6 fastadbfile /^>/

The file is split into numbered subfiles, each containing a single sequence.

Separating File Components with cut

Usage: cut -c list filenames
or cut -f list -d delim - s file

The cut command outputs selected parts of each line of an input file. A line in a file is simply any stretch of characters that ends with a specific delimiter; a delimiter is a special nontext character an operating system or program recognizes. Lines in files are terminated with an EOL (end-of-line) character; files themselves are terminated with an EOF (end-of-file) character. These characters are usually invisible to you when you're working with the file, but they are important in how a file is treated by programs that read it.

For example, say you have a file called sequence_data that contains the following:

ATC  TAC
ATG  CCC
GAT  TCC

Here's how to use cut to output the first character of each line in the file:

% cut -c 1 sequence_data 
A
A
G

And here's how to output the first line of fields 1 and 2:

% cut -f 1-2 sequence_data 
AAT  TAC

Portions of each defined line can be selected by character number in the line with the -c option, or by field with the -f option. Fields are stretches of characters within a line that are defined by delimiters. The most obvious delimiter for use within the text of a file is simply the space character, but other characters can be used as well. Fields are different from columns, which are strictly defined by numbering each character in the input line.

The list argument specifies the range of each line, whether in characters or in fields, to be selected.

The list is in the form of single numbers or of two numbers separated by a - character. Multiple single columns or ranges can be selected by separating them with commas. Either the first or the last number can be omitted, indicating that the cut starts at the beginning of the line or that it ends at the end of the line. Characters and fields in each line are numbered starting at 1.

When the -f option is used, indicating that cut is to count fields rather than characters, a delimiter other than the default tab character can be specified with the -d option. The -s option causes cut to ignore lines that don't contain the specified delimiter. This option can be useful, for example, for ignoring header lines in a table.

Combining Files with paste

Usage: paste -[ options ] files

The paste command allows you to combine fields from several files into one larger file. Unlike the join command, which does a database-style merging of two files, paste is a purely mechanical combination of files. Lines are combined based solely on their line number in each file: i.e., the first line of file1 is pasted next to the first line of file2, regardless of the content of the lines. Pasted data is separated by a tab character unless another delimiter is specified with the -d option. With the -s option and only one input filename, paste joins all the lines in the input file into one long line.

paste can prepare datafiles to be read by data-analysis applications. If you have a group of files in the same format and you have used other filter commands to remove corresponding information from each of them, you can prepare one input file that allows you to plot the corresponding information from each of the files without reading them independently. In a previous example, we used piped commands to extract a column from a table in a complicated output file:

% head -167 protein1.pka | tail -100 | cut -c30-39 > protein1.pka.data 

If you have eight similar output files for proteins 1-8, you can process them all in the same way and then paste the results that you're interested in comparing into one big datafile:

% paste protein*.pka.data > allproteins.pka.data 

Each individual file in this example might look something like this:

3.8 
12.0 
10.8 
4.4 
4.0 
6.3 
7.9 

Each number represents the computed pKa value of one amino acid in a protein. If you have several sets of results that can be meaningfully combined into a table, paste creates a simple tab-delimited table that looks like this:

3.8 3.2 3.6 
12.0 12.9 12.5 
10.8 10.9 11.0 
4.4 4.2 4.5 
4.0 3.9 4.2 
6.3 6.5 6.2 
7.9 7.5 8.0 

It's up to you, however, to understand how your data can be meaningfully combined into a table and to use the paste command correctly to get the result you want.

Merging Datafiles with join

Usage: join -[ options ] file1 , file2

join merges two files based on the contents of a specified join field, where lines from the two files having the same value in the join field are assumed to correspond. Files are assumed to have a tabular format consisting of fields and a consistent field separator, and are assumed to be sorted in increasing order of their join fields.

Command-line options for join include:

-1 fieldnum

Uses the specified field number as the join field in file 1

-2 fieldnum

Uses the specified field as the join field in file 2

-t character

Uses the specified character as the delimiter throughout the join operation

-e string

Replaces empty output fields with the specified string

-a filenum

Produces output for each unpairable line in the specified file; can be specified for both input files; fields belonging to the other output file are empty

-v filenum

Produces output only for unpairable lines in the specified file

-o list

Constructs the output lines from the list of specified fields, where the format of the field list is filenum.fieldnum; multiple items in the list can be separated by commas or whitespace

join is quite useful for constructing data tables from multiple files, and a sequence of join operations can construct a complicated file. In a simple example, there are three files:

mustelidae.color:
badger black 
ermine white 
long-tailed tan 
otter brown 
stoat tan

mustelidae.prey:
ermine mouse 
badger mole 
stoat vole 
otter fish 
long-tailed mouse

mustelidae.habitat:
river otter 
snowfield ermine 
prairie long-tailed 
forest badger 
plains stoat 

First, combine mustelidae.color and mustelidae.prey. The field both have in common is the name of the animal, which is the first field in each file. mustelidae.prey isn't yet sorted. The form of the join command needed is:

% sort mustelidae.prey | join mustelidae.color - > outfile

which produces the following output:

badger black mole 
ermine white mouse 
long-tailed tan mouse 
otter brown fish 
stoat tan vole

Now combine the resulting file with mustelidae.habitat. If you want the resulting output to be in the form habitat animal prey color, use the command construct:

% sort -k2 mustelidae.habitat | join -1 2 -2 1 -o 1.1,2.1,2.3,2.2 - outfile

This operates on the standard input and the output file from the previous step to produce the output:

forest badger mole black 
snowfield ermine mouse white 
prairie long-tailed mouse tan 
river otter fish brown 
plains stoat vole tan

Sorting Files with sort

Usage: sort -[ general options ] - o [ outfile ] -[ key interpretation options ] - t [ char ] - k [ keydef ]...[ filenames ]

The sort command can sort a single file, sort a group of files and simultaneously merge them into a single file, or check a file for sortedness. This function has many applications in data processing. Each line in the file is treated as a single field by default, but keys can also be defined by the user on the command line.

The main options for sort are:

-c

Tests a file for sortedness based on the user-selected options

-m

Merges several input files

-u

Displays only one instance of lines that compare as equal

-o outfile

Sends the output to a file instead of sending it to standard output

-t char

Uses the specified character to delimit fields

Options that determine how keys are interpreted can be used as global options, but they can also be used as flags on a particular key. The key interpretation options for sort are:

-b

Ignores leading or trailing whitespace in a sort key.

-r

Reverses the sort order for a particular key.

-d

Uses "dictionary order" in making comparisons; i.e., characters other than letters, digits, and whitespace are ignored.

-f

Reclassifies lowercase letters as uppercase for the purpose of making comparisons. Normally, L and l would be separated from each other due to being in uppercase and lowercase character sets; with the -f flag, all L's end up together, whether capitalized or not.

Specifying sort keys

Key definitions are arguments of the -k option. The form of a key definition is position1,position2. Each is a numerical value that specifies where within the line the key starts and ends. Positions can have the form field.character, where field specifies the field position in the input line, and character specifies the position of the starting character of the key within its individual field. If the key is flagged with one of the key interpretation options, the form of the key is field.character[flags]. If the key interpretation option isn't applied to the whole sort, but merely to one key, then it's appended to the key definition without a preceding hyphen.

File Statistics and Comparisons

It's frequently useful to find out if two separate files are the same and, if not, where they have differences. For instance, if you have compiled a program on your local machine, and test cases are provided, you should run your copy of the program on the test cases and compare the output to the canonical output provided by the makers of the program. If you want to check that the backup copy of a file and the current version of the file are the same, file-comparison tools are very useful. Unix provides tools that allow you to do this without laboriously searching through the files by hand.

Comparing Files with cmp and diff

Usage: cmp -[ options ] file1 file2
Usage: diff -[ options ] file1 file2

Let's say you have two lists and, while they look similar, you can't tell by eye if they are exactly the same list. This can happen if you get a list of gene names back from database searches performed using two subtly different queries and want to know if they are equivalent. In order to compare them rigorously (and save your eyes in the process), you can try the semicomplementary commands cmp and diff. In short, cmp tells you whether two files are identical, and diff prints any lines that are different.

cmp is fairly simple-minded. Typing:

% cmp enolase1.list enolase2.list

produces no output if the two files are identical. Otherwise, cmp returns a message that the files differ and includes the character and line at which the first difference occurs.

diff is most useful for comparing different versions of a file to find exactly where the files differ. Before looking at diff ' s rather obtuse output, it's worth a moment to see how to decrypt it. Without options, diff responds with a list of differences in the form of the changes required to make file2 from file1:

x,y d i

Lines x through y in file1 are missing in file2 after line i (i.e., they've been deleted from file2).

i a x,y

Lines x through y in file2 are missing in file1 after line i (i.e., they've been added to file2).

i,j c x,y

Lines i through j in file1 have been changed to lines x through y in file2.

In practice, the output looks like this (where enolase1.txt and enolase2.txt are lists of names of putative enolases produced by two database searches performed at different times):

% diff enolase1.list enolase2.list 
1a2
> ENO_MESCR
5a7
> ENOA_MOUSE 

Here are two of the more immediately useful options diff uses:

-b

Ignores differences in whitespace between lines

-B

Ignores inserted or deleted blank lines between files

The info pages on diff and its variants are especially helpful. If you use this utility extensively, we strongly recommend you give them a look.

Counting Words with wc

Usage: wc -[ options ] filename (s)

wc is a simple and useful utility for counting things in text files. Given a text file, wc counts the number of lines, words, and bytes (characters) that it contains. The default setting for wc is to count all three entities, so that typing it at the command prompt returns a line that looks like:

% wc meercat1.txt 
   27   98   559 meercat1.txt 

This output tells you that there are 27 lines, 98 words, and 559 bytes in meercat1.txt. If you pass multiple files to wc, it returns counts both for individual files and for all of them combined. For example, if you run wc on the three meercat files:

% wc meercat1.txt meercat2.txt meercat3.txt

(or, to save time, wc meercat*.txt, being appropriately careful using the wildcard), the output looks like:

   41   130   905 meercat1.txt   
   50   124   869 meercat2.txt   
   10    19   156 meercat3.txt   
  101   273  1930 total 

These are the options for wc :

-c

Counts only bytes (characters)

-w

Counts only words

-l

Counts only lines

- -help

Prints a usage message

- -version

Prints the version of wc being used

Unix tools can often be used in combination to collect information you need. For instance, say you have a list of 1,000 files that need to be processed, and the output files are all saved together in the same directory. Instead of trying to list the contents of that directory using ls, you can use ls -1 dirname | wc to find how many output files have been created so far.

The Language of Regular Expressions

The pattern-matching language known as regular expressions allows you to search for and extract matches and to replace patterns of characters in files (given the right program). Regular expressions are used in the vi and Emacs text-editing programs. Since much of the data that biologists work with contains patterns, one of the first skills you need to learn is how to match patterns and extract them from files.

Regular expressions also are understood by the Perl language interpreter. Knowing how to use regular expressions along with the basic commands of Perl gives you a powerful set of data-processing tools. We'll cover the basics of regular expressions here, and return to them again in Chapter 12.

If you've ever used a wildcard character in a search, you've used a regular expression. Regular expressions are patterns of text to be matched. There are also special characters that can be used in regular expressions to stand for variable patterns, which means you can search for partial or inexact matches. Regular expressions can consist of any combination of explicit text and special characters.

The special characters recognized in basic regular expressions are:

The backslash acts as an escape character for a special character that follows it. If part of the pattern you are searching for is a dot, you give the regular expression chars.txt to find the pattern chars.txt.

.

The dot matches any single character.

*

The behavior of the asterisk in regular expressions is different from its behavior as a shell wildcard. If preceded by a character, it matches zero or more occurrences of that character. If preceded by a character class description, it matches zero or more characters from that set. If preceded by a dot, it matches zero or more arbitrary characters, which is equivalent to its behavior in the shell.

^

The caret at the beginning of a regular expression matches the beginning of a line. Otherwise, it matches itself.

$

The dollar sign at the end of a regular expression matches the end of a line. Otherwise, it matches itself.

[charset]

A group of characters enclosed in square brackets matches any single character within the brackets. [badger] matches any of (a, b, d, e, g, r). Within the set, only -, caret, ], and [ are special. All other characters, including the general special characters, match themselves. A range of characters in the form [c1 - c2 ] can also be given; e.g., [0 -9] or [A-Z].

Searching for Patterns with grep

Usage: grep -[ options ] ' pattern ' filenames

grep allows you to search for patterns (in the form of regular expressions) in a file or a group of files. GNU grep (the standard on Linux) searches for one of three kinds of patterns, depending on which of the following functions is selected:

-G

Standard grep : searches for a regular expression (this is the default)

-E

Extended grep : searches for an extended regular expression

-F

Fast grep : rapidly searches for a fixed string (a pattern made of normal characters, as opposed to regular expressions)

Note that the -E and -F options can be explicitly selected by calling egrep or fgrep on some systems. If no files are specified to be searched, grep searches the standard input for the pattern, allowing the output of another program to be redirected to grep if you are looking for a pattern in the output.

As a simple example, consider the following commands:

% grep -c '>' SP-caspases-A.fasta SP-caspases-B.fasta 
% grep '>' SP-caspases-A.fasta SP-caspases-B.fasta

These both search through a file of FASTA-formatted sequences (whose header lines, you will remember, begin with the > symbol). The first command returns the number of sequences in each file, while the second returns a list of the sequence headers. Be sure to enclose the > in quotes, though. Otherwise, as one of us once found out the hard way, the command is interpreted as a request for grep to search the standard input for no pattern and then redirect the resulting empty string to the files listed, overwriting whatever was already there.

grep takes dozens of options. Here are some of the more useful ones:

-c

Prints only a count of matching lines, rather than printing the matching lines themselves

-i

Ignores uppercase/lowercase distinctions in both file and pattern

-n

Prints lines and line numbers for each occurrence of a pattern match

-l

Prints filenames containing matches to pattern, but not matching lines

-h

Prints matching lines but not filenames (the opposite of -l )

-v

Prints only those lines that don't contain a match with pattern

-q

(quiet mode) Stops listing matches after the first occurrence

In protein structure files, protein sequence information is stored as a sequence of three-letter codes, rather than in the more compact single-letter code format. It's sometimes necessary to extract sequence information from protein structure files. In real life, you can do this with a simple Perl program and then go on to translate the sequence into single-letter code. But you can also extract the sequence with two simple Unix filter commands.

The first step is to find the SEQRES records in the PDB file. This is done using the grep command:

% grep SEQRES pdbfile > seqres

This gives you a file called seqres containing records that look like this:

SEQRES 1 357 GLU VAL LEU ILE THR GLY LEU ARG THR ARG ALA VAL ASN 2MNR 106 
SEQRES 2 357 VAL PRO LEU ALA TYR PRO VAL HIS THR ALA VAL GLY THR 2MNR 107 
SEQRES 3 357 VAL GLY THR ALA PRO LEU VAL LEU ILE ASP LEU ALA THR 2MNR 108

Not all the characters in each record belong to the amino-acid sequence. Next, you need to extract the sequences from the records. This can be done using the cut command:

% cut -c20-70 seqres > seqs

The output of this command, in the file seqs, looks like this:

GLU VAL LEU ILE THR GLY LEU ARG THR ARG ALA VAL ASN 
VAL PRO LEU ALA TYR PRO VAL HIS THR ALA VAL GLY THR 
VAL GLY THR ALA PRO LEU VAL LEU ILE ASP LEU ALA THR

If you don't want to create the intermediate file, you can pipe the commands together into one command line:

% grep SEQRES pdbfile | cut -c20-70 | paste -s > seqs. 

Addition of the paste -s command joins the individual lines in the file into one long line.

Unix Shell Scripts

The various Unix shells also provide a mechanism for writing multistep scripts that let you automate your work. Scripts are labeled as such because they contain, verbatim, the sequence of commands you want to "say" to the shell, just as the script for a play contains the sequence of lines the author wants the actors to say.

Shell scripts—even the simplest ones—are still applications, and they behave accordingly. Let's say you want to start a series of calculations that will take a while, and then go home to eat dinner. By default, the shell will wait until one command is finished to execute the next command, so if the second command acts upon the output of the first, it won't start prematurely. The important thing is that you don't have to be there to type the second command.

Here's a relatively simple example. Assume you have just downloaded the entire set of GenBank DNA sequence files. You want the information in the files, but you need it to be in a different format so that a program you've downloaded can process it. You're going to use the program gb2fasta to convert the files from GenBank to FASTA format. (This script assumes you've downloaded the GenBank files to your current working directory.) Then you want to process each file using the BLAST formatdb program. To make the script more flexible, you can write it so that it takes an optional file list on the command line to specify which files to process. The script might look like this:

#!/usr/bin/csh
foreach file ($*) 
echo $file 
gb2fasta $file > $file.na 
formatdb -t "$file" -i $file.na -p F 
end 

After creating the file, you need to make it executable using the chmod command. For instance, if the filename of the script is blastprep, give the command:

% chmod a+x blastprep

The first line of the script tells the operating system which shell program to use, the shell is invoked, and the job is run. You can invoke your command immediately in the following way:

./blastprep gbest*.seq 

In order to run the new script without giving its full path, you need to run the rehash command before typing this command. rehash is a C-shell command that updates the list of all executable files in your path.

In the previous example, all the GenBank EST files are automatically parsed and prepared for use with BLAST. The programs gb2fasta and formatdb run just as they do on the command line, but you don't have to wait for each command to complete. The script takes your command-line argument—in this case gbest*.seq, which is a list of filenames—and sequentially fills the variable $file with each value. It then loops through the lines between the "foreach" and "end" lines. The echo command simply sends the value of $file to standard output, so you can see in your terminal window how the job is progressing. The gb2fasta program normally prints to standard output, so you need to redirect the output to a specific filename. On the other hand, formatdb processes the input files and generates new files using an internal naming convention, so no output file is needed in the script.

Communicating with Other Computers

As we'll see in Chapter 6, the ability to plug into other computers and networks across the world allows you to read and download an amazing amount of information, as well as share data with your colleagues. In fact, your work as a bioinformatician depends on having access to public databases and other repositories of biological data. In this section, we look at how your computer communicates with other machines and the tools it uses to do so.

The Web

The easiest way to communicate with other computers is via the Web. Most distributions of Linux include web browser software—usually Netscape—which, if you select it from the list of installation options, is automatically installed for you. Setting up a web browser on a Linux system is the same as setting up a browser on other computers; you need to set the browser's preferences and tell it where the correct utilities are located to open different kinds of file attachments.

You may want to maintain a web page on your machine, and in order to do that, you need to install web server software. Again, most Linux distributions allow you to install the Apache web server software as one of your installation options. If you choose to install the Apache web server, you can publish a simple web site by placing the appropriate HTML files in the /home/httpd/html directory.

IP Addresses and Hostnames

In the world of the Internet, computers recognize each other by their Internet Protocol (IP) addresses. Computers that are constantly connected to the Internet have permanently allocated IP addresses and hostnames, while computers that only connect to the Internet occasionally may have dynamically allocated IP addresses, or no IP address at all, depending on the protocol they use to connect.

IP addresses consist of four numbers separated by dots (e.g., 128.174.55.33). These are interpreted as directions to the host (a computer that communicates with other computers) by network software. Computers also have hostnames, such as gibas.biotech.vt.edu. Name servers are dedicated machines that maintain information about the relationships among IP addresses and hostnames.

telnet

Usage: telnet full.hostname

The telnet command opens a shell on a remote Unix machine; the workstation on which the command is issued becomes a terminal for that machine. To telnet to another Unix machine, you must have a login on that machine. Once you're logged in to the remote host, the shell works just as if you were working directly on the remote machine.[#]

A "login:" prompt should appear, followed by a "password:" prompt after your ID is entered.

ftp

Usage: ftp full.host.name.edu

The File Transfer Protocol (ftp) is a method for transferring files from one computer to another. You may be familiar with Fetch, Interarchy, or other PC-based FTP clients; Unix ftp is conceptually similar to these programs (and many of them have analogs that run under Linux, if you like their graphical user interfaces). When you use ftp to connect to another host, you will find yourself in an operating environment that is unique to ftp. Unix commands don't always work in the ftp environment, although the commands ls and cd have similar functions.

Again, a "login:" prompt appears, followed by a "password:" prompt. If you are accessing an anonymous FTP server (a common way to distribute software), the standard username is anonymous, and your email address is the password. Once in the FTP environment, the most important commands to know are:

help

Prints out the list of ftp commands. help command prints out information on a specific command.

ls

Lists the contents of the directory on the remote host.

cd

Changes the working directory on the remote host.

lcd

Changes the working directory on the local host.

get, mget

get copies a single file from the remote host to the local host. mget copies multiple files.

put, mput

put copies a single file from the local host to the remote host. mput copies multiple files.

binary, ascii

Changes the file-transfer mode to binary or ASCII. You should choose binary when you are downloading binary executables, images, and other encoded file formats.

prompt

Toggles the interactive mode that asks you to confirm every transfer when you transfer multiple files.

Displaying from a Remote Terminal

Sometimes you need to run an X program on another computer and have it display on your terminal. This is relatively simple to do. First, you need to set your own terminal to allow remote displays from other hosts. This is done using the xhost command:

% xhost + 

A confirmation that access is allowed from other hosts is then printed to standard output.

Next, you need to change the display environment on the remote machine. This is done with the setenv command:

% setenv DISPLAY yourmachine.yoursubnet.wherever.edu:0 

Not all X applications running on a remote server can use your terminal for display, generally because the remote machine and your machine don't have the same graphics capabilities. For instance, programs running on a remote Silicon Graphics machine can't display on your local Linux workstation, because Silicon Graphics uses proprietary graphics libraries that aren't currently available to Linux users. However, even if both machines are compatible, bandwidth limitations can make running large X programs over the network extremely slow.

Communication and File Sharing

One of the biggest inconveniences for Linux users in a primarily Mac/PC environment is the sharing of files generated by PC productivity software with other users. While it's not our purpose to teach you to use these packages here, we can mention a few options that will help you handle communication with non-Unix users.

Fortunately, there are relatively low-cost software products available for Linux that make it possible to work with common file types, such as Microsoft Word and rich-text format (RTF) documents, PowerPoint presentations, and Excel spreadsheets. Sun's StarOffice (http://www.staroffice.com) and Applix's Applixware (http://www.vistasource.com) are two possibilities; at the time of this writing, StarOffice seemed to do the cleanest job of converting files generated by Microsoft Word and other commonly used programs. Adding one of these packages to your Linux system will add most of the basic PC functions (word processing, electronic presentations, etc.) that may be vital to your work.

Most kinds of graphics files are easily handled and converted on Linux systems. One powerful tool for manipulating graphics files is called the GIMP (Gnu Image Manipulation Program, http://www.gimp.org). The GIMP is commonly included in Linux distributions, so be sure to select it as part of your installation if you will be doing anything with graphics files. The GIMP is analogous to Adobe Photoshop program and shares most of the same functionality.

Media Compatibility

Linux users can read and write files on Microsoft-formatted floppy disks and Zip disks. A floppy or Zip disk is treated as an additional filesystem on your computer. The most basic way to access this filesystem is to mount it using the mount command. To do this, you need to know the device ID of the disk you are trying to mount and establish a mount point for the new filesystem.

Determining the device IDs of the various drives is usually straightforward. One way is to open the file /var/log/dmesg. This file contains the system information that is printed to standard output when the machine is booted. Scan through the file and find the drive information, which should look like this:

hdc: SAMSUNG SC-140B, ATAPI CDROM drive 
hdd: IOMEGA ZIP 250 ATAPI, ATAPI FLOPPY drive 
hdc: ATAPI 40X CD-ROM drive, 128KB Cache 
Floppy drive(s): fd0 is 1.44M 

This section of the file contains information about IDE devices. On this particular machine, the IDE devices include a CD-ROM drive, a Zip drive, and a floppy drive. The three-letter codes hdc, hdd, and fd0 are the device IDs.

The next section of the file contains information about SCSI devices. On this particular machine, the main hard disk is a SCSI drive, and its ID is sda. sda1, sda2, etc., are the individual IDs of the partitions on the hard drive:

Detected scsi disk sda at scsi0, channel 0, id 0, lun 0 SCSI device 
sda: hdwr sector= 512 bytes. Sectors= 35566499 [17366 MB] [17.4 GB] 
sda: sda1 sda2 sda3 sda4 < sda5 sda6 sda7 sda8 sda9 >

Accessing Devices as Unix Filesystems

Once you know the device IDs, mounting these new filesystems is simple. If you're the root user of your own machine, the command is:

mount -t [filesystem type] devicefile mount point
               

For example, to mount a PC-formatted floppy disk at /mnt/floppy, the command is:

% mount -t msdos /dev/fd0 /mnt/floppy

You can find a listing of allowed file types in the manpages for mount.

As a shortcut, you can modify your /etc/fstab file to contain the following lines:

/dev/fd0         /mnt/floppy       vfat  noauto,owner  0 0 
/dev/hdd4        /mnt/zip          vfat  noauto,owner  0 0 

On this system, the Zip drive is located at /dev/hdd. All PC-formatted Zip disks use partition number 4, and the device file for that partition is /dev/hdd4. The noauto flag means that these disks aren't mounted automatically at boot time. Once these lines are added to /etc/fstab, the devices can be mounted with the shortened command mount devicefilename.

Once the Zip or floppy is mounted as a partition, the files on that disk can be treated like any other file on the system.

Getting some of these devices working isn't as straightforward as we'd like it to be. For further help, you can search the Web for the Linux how-to pages for the particular device you're using.

Accessing Devices as DOS Disks

If you install the utility package mtools and its graphical frontend mfm, you can run mfm and move files to Zip or floppy disks, using a graphical interface similar to that on a PC. However, if you use this method to access devices, you can't run Unix commands on the files stored on your media until you move them onto the local hard disk.

By default, processes to access media may be run only by the root user. It's possible to configure your system so that other users can write to floppy and Zip drives. However, this creates a security hole in your system. You have to decide for yourself whether the benefits of easy disk access outweigh any potential risks.

Playing Nicely with Others in a Shared Environment

Unix environments traditionally have been multiuser environments. While the availability of new flavors of Unix for personal computers might change this on your computer at home, at work you will probably use a shared, networked Unix system at least some of the time. And even on a personal Unix system, you need to be aware of problems that can arise when you create an excessive load on your system, and of how background processes can interfere with your ability to run interactive processes.

Because the Unix operating system can interact with more than one user at a time, from terminals attached directly to the system or over a network, there can be many processes executing on your system. Some processes will be yours, and others will belong to users who may be working across the room from you or hundreds of miles away. To be a good citizen in a Unix environment, you need to share the system's resources. While administrators of large public systems make it nearly impossible for you to be a bad citizen by implementing quotas for space usage and queueing systems for process management, it isn't likely that all systems you use will be so tightly managed. On shared systems in which good faith is all that's keeping users from stepping on each other's toes, it's wise to manage your own processes responsibly. Otherwise someone's going to come gunning for you, and it won't be pretty.

Processes and Process Management

A Unix system carries out many different operations at the same time. Each of these operations, or processes, has a unique process ID that allows the user or the administrator to interact with that process directly.

There are a minimum number of processes that run on a system regardless of whether you actively initiate them. Each shell program, whether idle or active, has a process ID attached to it. Several system (or root) processes, sometimes known as daemons , are constantly active on the system. These processes often lie in wait for you to initiate some activity they handle: for instance, printing files, sending email, or initiating a telnet session.

Above and beyond this minimal system activity level are any processes you initiate by typing a command or running a program. The Unix kernel manages these processes, allocating whatever resources are available to the processes according to their needs.

Each process uses a percentage of the processing capacity of the system's CPU or CPUs. It also uses a percentage of the system's memory. When the processes running on a machine require more than 100% of the CPU's capacity to execute, each individual process will execute more slowly. While Unix does an extremely good job of juggling hundreds of processes that run at the same time without having the machine roll over and die, eventually you will see a situation where the load on the machine increases to the point that the machine becomes useless. The operating system uses many techniques to prevent this, such as limiting the absolute number of processes that can be started and swapping idle jobs out of memory. Even on a single processor system, it's possible to have multiple processes running concurrently as long as there is enough space for both jobs to remain in memory. At the point at which the CPU has to constantly wait for data to get loaded from the swap space on the hard drive, you will see a great drop in efficiency. This can be monitored using the top command, which is described in Section 5.9.1.3. Many machines are more limited by lack of memory than they are by a slow CPU, and it's often now more cost-effective to put money into additional RAM than to buy the latest, greatest, and fastest CPU.

Checking the load average

Usage: w

The w command is available on most Unix systems. This command can show you which other users are logged into the system and what they are doing. It also shows the current load average on the system.

The standard output of the w command looks like this:

2:55pm up 37 days, 4:50, 4 users, load average: 1.00, 1.02, 2.00 
USER    TTY    FROM        LOGIN@    IDLE   JCPU   PCPU WHAT 
jambeck tty1              22Jan99  37days  3:55m  0.06s startx 
jambeck ttyp0  :0.0       Wed 5pm   1:34m  0.22s  0.22s -csh 
jambeck ttyp3  :0.0       21Feb99    3:47  9.05s  8.51s telnet weasel 
god     ttyp2  around      2:52pm   0.00s  0.55s  0.09s create world 

The first line of the output is the header. It shows the time of day, how long the machine has been up, how many users are logged in, and what the load average on the system has been for the last 1 minute, 5 minutes, and 15 minutes. The load average represents the fractional processor use on the system. If you have a single processor system and a load average of 1, the system is being used at optimal capacity. A four-processor system with a load average of 2 is being used at only half of its capacity. If you log in to a system and it's already being used at or beyond its capacity, it's not polite to add other processes that will start running right away. The batch or at commands can set up a process to start when resources become available.

The information displayed for each user is the username, the tty name, the remote host from which the user is logged in, the login time, the idle time, the JCPU and PCPU times, and what the user is doing.

Listing processes with ps

Usage: ps [ options ]

ps produces a snapshot of what the processor is doing at the moment you issue the command. Depending on what your computer is doing at the time, typing ps at the prompt should give output along the lines of:

  PID TTY    TIME CMD
36758 ttyq10 0:02 tcsh    
43472 ttyq10 0:00 ps    
42948 ttyq10 4:24 xemacs-20    
42967 ttyq10 1:21 fermats-last-theorem-solver 

Most of ps 's options modify the types of processes on which ps reports and the way in which it reports them. Here are some of the more useful options:

a

Lists every command running on the computer, including those of other users

l

Produces a long listing of processes (process memory size, user ID, etc.)

f

Lists processes in a "tree" form, showing related processes

Notice that you don't need to preceed the option with a dash. There are actually a couple of dozen options for ps ; check info ps to see which options are supported by your local installation.

top

Usage: top -[ options ]

The top command provides real-time monitoring of processor activity. It lists processes on the system, sorted by CPU usage, memory usage, or runtime. The top screen looks like this:

4:34pm up 37 days, 6:29, 4 users, load average: 0.25, 0.07, 0.02 
42 processes: 39 sleeping, 3 running, 0 zombie, 0 stopped 
CPU states: 42.9% user, 6.4% system, 0.0% nice, 51.0% idle 
Mem:  39092K av, 38332K used,   760K free, 13568K shrd,  212K buff 
Swap: 33228K av, 20236K used, 12992K free               8008K cached

  PID USER    PRI NI  SIZE  RSS SHARE STAT LIB %CPU %MEM   TIME COMMAND
  516 jambeck  15  0  4820 3884  1544 R      0 30.4  9.9   4:23 emacs-fgyell
  415 root      9  0 10256 9340   888 R      0 15.5 23.8 161:41 /usr/X11R6/b 
10756 cgibas    5  0   716  716   556 R      0  2.3  1.8   0:01 top-ci

The header is similar to the output of w but more detailed. It gives a breakdown of CPU and memory usage in addition to uptime and load averages. The display can be changed to show a variety of fields. The default configuration of top is set in the user's .toprc file or in a systemwide /etc/toprc file.

Here are the top options:

-d

Updates with a frequency of delay

-q

Refreshes without any delay, running at the highest possible priority

-s

Runs in secure mode, with its most potentially dangerous commands disabled

-c

Prints the full command line instead of just the command you're running

-i

Ignores all processes except those currently running

While top is running, certain interactive commands can be entered, unless they are disabled from the command line. The command i toggles the display between showing all processes and showing just the processes currently running. k kills a process. It prompts you for the process ID of the process to kill and the signal to send to it. Signal 15 is a normal kill; signal 9 is a swift and deadly kill that can't be ignored by the process. r changes the running priority of a process, implementing the renice command discussed in Section 5.9.1.5. It prompts you for the process ID and the new priority value for the job.

Signaling processes with kill

Usage: kill [-s signal | -p ] [ -a ] PID

The kill command lets you terminate a process abnormally from the command line. While kill can actually send various types of signals to a process, in practice it's most often used in the form kill PID or, if that fails to kill the process, kill -9 PID.

On most systems, kill -l lists the available types of signals that can be sent to a process. It's sometimes useful to know that jobs can be stopped and restarted with kill -s STOP and kill -s CONT. [**]

A PID is usually just the numerical process ID, which you can find with the ps or top commands. It can also be a process name, in which case a group of similarly named processes can be addressed. Another useful form of PID is -n process group ID, which allows the kill command to address all the processes in a group simultaneously.

Setting process priorities with nice and renice

Usage: nice -n [ val command arg ]
Usage: renice -n [ incr ] [-g|-p|-u] id

Processes initiated on a Unix system run at the maximum allowed priority unless you tell them to do otherwise. The nice and renice commands allow the owner of a process, or the superuser, to lower the priority of a job.

If limited computing resources are shared among many users and computers are used simultaneously for computation and interactive work, it's polite to run background jobs (jobs that run on the machine without any interactive interface) with a low priority. Otherwise, interactive jobs such as text editing or graphical-display programs run extremely slowly while background jobs hog the available resources. Jobs running at a low priority are slowed only if higher-priority processes are running. When the load on the system is low, background jobs with low priority expand to use all the available resources.

You can initiate a command at a low priority using nice. n is the priority value. On most systems, this is set to 10 by default and can range from 1-19, or 0-20. The larger the number, the lower the priority of the job, of course.

The renice command allows you to reset the priority of a process that's already running. incr is a value to be added to the current priority. Thus, if you have a background process running at normal priority (priority 1) and you want to lower its priority (by increasing the priority number), you can enter renice -n 18 to increase the priority value to 19. You can also input a negative number to put the job at high priority, but unless you are root, you are limited to raising its priority to 1. The renice options, -p, -g, and -u, cause renice to interpret id as a process ID, a process group ID, or a user number, respectively.

Scheduling Recurring Activities with cron

The cron daemon, crond, is a standard Unix process that performs recurring jobs for the system and individual users. System activities such as cleanup of the /tmp directory and system backups are typically functions controlled by the cron daemon. Normal users can also submit their own jobs to cron, assuming they have permission to run cron jobs. Details about cron permissions are found in the crontab manpage. Since the at and batch commands, which are discussed later, are also controlled by cron, most systems are configured to allow users to use cron by default.

Submitting jobs to cron using crontab

Usage: crontab -[ options ] file

Submission of jobs to cron is done using the crontab command. crontab -l > file places the current contents of your crontab into a file so you can edit the list. crontab file sends the newly edited file back and initializes it for use by cron. crontab -r deletes your current crontab.

cron processes the contents of all crontab 's and then initiates jobs as scheduled. A crontab entry as produced by crontab -l looks like:

# Format of lines: 
#min  hour  daymo  month  daywk  cmd 
   50     2      *      *      *  /home/jambeck/runme 

This entry runs the program runme at 2:50 A.M. every day. An asterisk in any field means "perform this function every time." In this entry, all output to either STDOUT or STDERR is mailed to user jambeck 's email account on the machine where the cron job ran.

Using cron to schedule a recurrent database search

What if your group performs DNA sequencing on a daily basis, and you want to use the sequence-alignment program BLAST to compare your sequences automatically against a nonredundant protein database? Consider this crontab entry:

01 4 * * * find /data/seq/ -name "*.seq" -type f -mtime -1 -exec 
/usr/bin/csh /usr/local/bin/blastall -p blastx -d nr -i '{' ";'

This automatically runs at 4:01 A.M. and checks for all sequences that have been modified or added to the database in the last 24 hours. It then runs the BLASTX program to search your copy of the nonredundant protein sequence database for matches to your new sequences and mails you the results. This example assumes you have all the necessary environment variables set up correctly so that BLAST can find the necessary scoring matrixes and databases. It also uses a default parameter set, which may need to be modified to get useful results. Once you get it configured correctly, all you have to do is browse through your email while you drink your morning coffee.

Scheduling processes with batch and at

Usage: at -[ options ] time
Usage: batch -[ options ]

The batch and at commands are standard Unix functions and are commonly available on most systems. Jobs are submitted to queues, and the queues are processed by the cron daemon; jobs are governed by the same restrictions as crontab submissions. The batch command assigns priorities to jobs running on the system. Using batch allows a system administrator to sort jobs by priority—high to low—thereby allowing more important jobs to run first. Unless the system has a mechanism to kill interactive jobs that exceed a specified time limit, this use of the batch queue relies on users to work in a cooperative manner. On larger systems the function of batch is usually replaced by more complicated queuing systems. You need to get information from your system administrator about which batch and at queues are available.

at allows you to submit a job to run at some specified time. batch sequentially runs jobs whenever the machine load drops below a specified level and the number of concurrent batch jobs has not been exceeded. Once you initiate at or batch, all command-line entries are considered part of the job until you terminate the submission with a Ctrl-D keystroke. Like cron, any STDOUT and STDERR generated by the job are mailed to you, so you at least get notified of error conditions. Here are the common options:

-q queuename

Specifies the queue. By default, at uses the "a" queue; batch uses the "b" queue.

-l

Causes at to list the jobs current in the specified queue.

-d jobid

Tells at to delete a specified job.

-f filename

Instructs at to run the job from a file rather than standard input.

-m

Instructs batch and at to send mail upon completion, even when no output is generated.

time

Time can be now, teatime, 7:00 P.M., 7:00 P.M. tomorrow, etc. Check the manpage for more details.

As an example, let's say that you want your boss to think you were slaving away at 3:00 A.M. Simply send her mail at 3:07 A.M. Even if you don't plan on being awake, it's no problem. At the at command prompt, just type:

> at 3:07am 
Mail -s "big breakthrough" boss@wherever < /home/jambeck/news 
<Ctrl-d> 

Monitoring Space Usage and File Sizes

As fast as available disk space on a system expands, users seem to be able to expand their files to fill it. Software takes up more space; output files become larger and more complex; more layers of analysis can be created. Since the infinitely large data-storage medium has yet to be invented, you can still run up against disk-space limitations. So, you need to be able to monitor how much space you are using and, as we'll discuss in Section 5.9.4, how to make data archives and store them on appropriate media.

Checking disk usage with du

Usage: du -[ options ] filenames

du reports the number of disk blocks used by the specified file or files. Without a filename, it reports disk usage for all files in the current working directory. The -s flag causes du to report values only for the named file, rather than for the file and its subdirectories.

Checking for free disk space with df

Usage: df

df reports free diskspace for local and networked filesystems on your computer. df is a useful way to find out which filesystems are mounted on your computer. If a connection to a filesystem you would expect to find is down, that filesystem doesn't appear in the df output. The df output looks like this:

Filesystem            Type    blocks       use    avail %use Mounted on 
/dev/root              xfs  17506496  14113592  3392904   81 / 
/dev/xlv/xlv_raid_home xfs  62783488  39506328 23277160   63 /scratch-res1 
/dev/dsk/dks0d5s0      xfs  17259688  15000528  2259160   87 /mnt/root-6.4 
/dev/dsk/dks12d1s7     xfs  17711872  11773568  5938304   67 /ip410 
/mnt/local/jmd/balder: NFS server balder not responding 
zeus:/hamr             nfs   2205816    703280  1502536   32 /nfs/zeus/hamr 
zeus:/hamrscr          nfs   4058200   2153480  1904720   54 /nfs/zeus/hamrscr 
zeus:/lcascr1          nfs 142241472 103956480 38284992   74 /nfs/zeus/lcascr1 

The first column is the actual location of the filesystem. In this case, locations preceded with / are local, and those preceded with a name (e.g., /zeus:...) are physically part of another machine. The second column shows which protocol can mount the remote filesystem—that is, connect it to your computer. The next three columns show how many blocks are available on the filesystem, how many of those are in use, and how many are available, followed by the percent use of each device. The final column shows the local path to the filesystem.

It's useful to know these things if you are working on a system that is made up of multiple networked machines. From time to time connections are lost, like that to balder in the previous example. You may log in to a machine that can't find your home directory because an NFS connection is down. At these times, it's useful to be able to figure out what the problem is so you can send a concise and helpful email to the system administrator rather than just saying "help! My home directory is missing."

Checking your compliance with system quotas with quota

On some Unix systems, especially those that provide services to many users, system administrators implement disk space quotas for each user. The consequences of exceeding a disk space quota may be obvious. You might find that you're unable to write files or that you are automatically prompted to delete files each time you log in. Or, the consequences may be silent, but very annoying. For instance, if you exceed a quota, you may be able to run a text editor, only to find that it has overwritten the file you were editing with a file of length zero. Or your older files may simply start to be deleted as space is needed by other users.

If you're paying for computer time on a shared system, it's in your interest to find out what the user quota for the system is, for how long you can exceed it, what will happen if you exceed it, and where and how you can archive your files.

The quota command gives basic information about space usage and quota limits on systems with quotas. On most Unix systems, issuing the command quota -v gives space use information even when user disk quotas haven't been exceeded.

Creating Archives of Your Data

So, after months of your time, hundreds of megabytes of files, and several layers of subdirectories, the otter project is finally complete. Time to move on to the next project with a clean slate. But as refreshing as it may sound, you can't just type:

% rm -rf otter/

Other people may need to look back at your findings or use them as a starting point for their own research. At the other extreme, you can't leave your files lying around or laboriously copy them a few at a time to another location. Not every file needs to be accessible at all times; some files are replaced, while others are more conveniently stored elsewhere. This section covers the tools provided by Unix for archiving your data so you don't have to worry about it on a day-to-day basis but can find things later when you need them.

tar: Hold the feathers

Usage: tar functions [ options ] [ arguments ] filenames

After going through all the effort of setting up your filesystem rationally, it seems like a waste to lose that structure in the process of storing it away, like hastily packed dishes in an unexpected cross-country move. Fortunately, there is a Unix command that lets you work with whole directories of files while retaining the directory structure. tar compacts a directory and all its component files and (if you ask for it) subdirectories into a single file with the name of the compacted directory and a .tar extension. The options for tar break down into two types: functions (of which you must choose one) and options. tar is short for "tape archive," since the utility was originally designed to read and write archives stored on magnetic tape. Another common use of tar is to package software in a form that can be easily transferred over the Internet.

To run tar, you must choose one of the following functions:

c

Creates a new tape archive

r

Appends the files to an existing archive

u

Adds files to the archive if they aren't present or are modified

x

Extracts files from an existing archive

t

Prints a table of contents of the archive

The options for tar are as follows:

f archive

Performs the specified operation on archive, which can either be a device (such as a tape drive or a removable disk) or a tar file

v

(verbose mode) Prints the name of each file archived or extracted with a character to indicate the function (a for archived; x for extracted)

w

(whiny mode) Asks for confirmation at every step

Note that neither functions nor options require the hyphen that usually precedes Unix command options.

If you type:

% tar cvf otter/

the otter/ directory and all its subdirectories are rolled into a single file called otter.tar. It's good practice to use the v option, so you can see if something is going horribly wrong while the archive is being processed.

If, on the other hand, you want to make an archive of the otter/ directory on the tape drive nftape, you can type:

% tar cvf /dev/nftape otter/

A couple of warnings about tar are in order. First, before you use tar on your system, you should use which to find out whether the GNU or the standard version is installed. Several of the options mean different things to each version; the ones listed earlier are the same in each version.

Second, the tar file you create will be as large as all the contents of the directory and subdirectories beneath it. This condition has dire implications if your archived directory is large and you have limited disk space, or you need to transfer large amounts of tar 'd data. In these cases, you should break down the directory into subdirectories of a more manageable size, and tar those instead.

If you don't have the space on your current filesystem or partition for your files and the archive you are creating to exist simultaneously, or you wish to download a whole archive file and unpack it just to retrieve a few files, you can transfer your archive over the network or even just to another partition using a combination of ftp and tar commands. Sending an archive this way and then extracting it at the destination can be less time-consuming than a cp -r if a large number of files are involved. The ftp program recognizes a form in which a command replaces the input filenames. The command is executed in a subshell on the local machine and operates on files on the local filesystem. The construct is:

ftp command "| command" filename

Inside the ftp program, here's how to send the output of the tar command, enclosed in quotes, into the filename specified as the target on the remote machine:

put "|tar cvBf - *" filename

Here's how to direct the downloaded archive through the tar command, resulting in extraction of only the files in the specified directory within the archive:

get filename.tar "|tar xvf - dirname"

Finally, here's how to list the contents of the remote archive:

get filename.tar "|tar t - *"

compress

Usage: compress -[ options ] filenames

Ultimately, you don't want to be left with large—if more manageable—tar files cluttering up your filesystem. In this situation, data-compression utilities are important, since they allow you to cheat and reduce the amount of space that files take up on your hard disk. compress is the standard Unix file-compression command. It's the opposite of uncompress , the command used in Chapter 3 to open compressed papers and software. compress adds a .Z to the end of the filename.

Here are the most useful options for compress :

-f

Forces compression; even if there is already a compressed version of the file, the main effect is to not overwrite an existing compressed file

-v

(verbose mode) Prints percentage compression achieved by the file

-r

(recursive mode) If compress is applied to a directory that contains subdirectories, compresses their contents as well as those of the original directory

If you have a text file named stoat.txt and the tar file of the otter/ directory from the last section, and you want to compress both and look at the resulting compression ratio achieved, type:

% compress -v stoat.txt otter.tar 

This command produces two files stoat.txt.Z and otter.tar.Z. The files can be uncompressed using the uncompress command or gzip -d (described next). In case you were wondering, natural languages (the kind humans use) end up with a compression ratio around 60%, and programming languages get around 40%. Try compressing the sequences of some of your favorite proteins to see what sort of ratio you get: the values can be wildly variable, depending on whether there are repeats in the sequence.

gzip

Usage: gzip -[ options ] filenames

As usual, in addition to the standard Unix compress, there's a faster and more efficient GNU utility: gzip. gzip behaves in much the same way as compress, except that it gets better compression on average, since it uses a superior algorithm. gzip adds the suffix .gz to a file that it compresses. It emulates the compress options described earlier and adds a few of its own:

-N

(default setting) Preserves the original name and timestamp from the file being compressed

-q

(quiet mode) Suppresses warnings when running

-d

Returns a file that has been compressed by gzip to its uncompressed state; gzip can also recognize and uncompress files produced by compress



[*] GNU tools are distributed and maintained by the GNU Project at the Free Software Foundation. GNU stands for "GNU's Not Unix" and refers to a complete, Unix-like operating system that's built and maintained by the GNU Project (http://www.Gnu.org).

[†] This isn't an imaginary format at all. It's pretty close to the format of the output file from a calculation that we do frequently: computing the pKa values of individual amino acids in a protein.

[‡] See the Bibliography for pointers to complete references on vi.

[§] Giving rise to the old joke, "Why do programmers confuse Christmas and Halloween? Because OCT 31 is DEC 25."

[‖] If you need to transform data in a way that isn't allowed by the standard Unix filters, see Chapter 12, in which we discuss the Perl scripting language. Perl is a very complete and sophisticated language that allows you to produce an infinite variety of specialized filters.

[#] If you are logged in as root, there are certain tasks you can't do from a remote terminal.

[**] Discussion of the other signals can be found in any of the comprehensive Unix references listed in the Bibliography.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset