Unix has a wealth of functions, and you'll want to be aware of a particular subset of them before you start running programs and collecting data. In Chapter 4, we talked about how to organize and manage your files in Unix, as well as how to move around the filesystem. In this chapter we take you on a whirlwind tour through the common Unix commands you'll need to know to work efficiently. We discuss the Unix shell itself, issuing commands in Unix, viewing, editing, and extracting information from your files, shell scripts, and working in a multiuser environment.
Once you've learned to use some of these Unix commands, you'll find that they are astonishingly powerful and flexible, allowing you to modify files in ways that are impossible, or at least not easy, with a conventional word-processing program. For example, with a single command you can find all the instances of a pattern in every file under your home directory. A few simple tricks can create a script that will process every file in your source data directory identically. Another simple script can update a customized local copy of a database every night while you're sleeping.
When you log into a Unix system or open a new window in your system's window manager interface, the system automatically starts a program called a shell for you. The shell program interprets the commands you enter and provides you with a working environment and an interface to the operating system. It's possible to work in Unix without the shell using graphical file manager tools, but you'll find that many shell commands are useful for data processing and analysis. Entire books devoted to the various shells are available, and the manpages for some of the common shells exceed 100 pages when printed. We provide you only with a brief introduction to the commonly used shells, to get you started with as few hurdles as possible.
The shell program you use affects the feel of your command-line interface. Some of the features that can be built into the shell program include a simple arithmetic interpreter that lets you use the command line as a calculator; command aliasing, which lets you refer to standard Unix commands with other more convenient words; filename completion, which lets you type only the number of characters necessary to distinguish a file from other files in the directory, rather than typing the full filename; command editing and command history, which let you scroll back through the commands you've recently issued and edit them on the command line; spelling correction; and help functions for the shell program.
There are a number of common shell programs on Unix systems. You are
automatically assigned a shell when your system administrator sets up
your account. On
Linux systems, the default shell program
is the bash
(Bourne Again) shell. However, you
may prefer to use a shell other than bash
. The
two main classes of shell programs are shells derived from the
Bourne
shell, sh
, and shells derived from the
C shell
csh
. Bourne-type shells include
sh
, bash
,
ksh
(the Korn shell), and
zsh
(the Z shell). C-type shells include
csh
and tcsh
.
We tend to prefer C shells, for historical reasons. When we started
working in Unix, the C shell was the best thing going, and the
tcsh
program has expanded the original
csh
into a powerful shell.
tcsh
implements most of the desirable shell
features, including history, command aliasing, filename completion,
command-line editing, arithmetic and functions, job control, and
spelling correction. tcsh
is also one of the
most user-configurable shells. Therefore, we'll discuss the
behavior of Unix commands from a C-shell perspective, as if you were
using the tcsh
program, which we use on our
machines.
Your default shell will be listed as the last item in your entry in
the /etc/passwd
file. If you aren't
certain which shell you are currently using, you can find out by
typing:
% finger your-user-name
For user jambeck
, this command shows the
following information:
Login name: jambeck In real life: Per Jambeck Directory: /home/jambeck Shell: /bin/tcsh
This tells us that he is using tcsh
as his
default shell. For practical reasons, we will limit our discussions
and most references to csh
and
tcsh
. It must also be noted that many system
processes (e.g., batch
, at
,
and cron
) use the Bourne shell by default, which
makes it necessary to learn at least a minimal subset of its command
language. On most systems there are commands to change your default
shell as set in the passwd
file. The
chsh
(change shell) command allows you to
change your default login shell, if you're working on a Linux
system.
There 's a standard format for sending an instruction to Unix. In this book, we'll refer to commands and to the command line. Each of Unix's many native commands has a tangible existence as an executable program, and to issue the command is to tell Unix to execute that program. In this section and those that follow, we move fairly quickly through concepts and commands. While we can give you a brief overview of the Unix features we find most useful, this book isn't designed to replace a comprehensive Unix reference book. If you're new to Unix, we strongly recommend that you review the basics of Unix with the help of books such as Learning the Unix Operating System, Running Linux, or Unix for the Impatient. We've provided a list of recommended reading in the Bibliography.
The command line consists of the command
itself, optional arguments that modify how the command works, and
operands such as files upon which the command operates. For example,
the chsh
(change shell) command, which we just
discussed briefly, has several possible options. The first is the
-s
option, which must be followed by the name of
a shell program as its argument. The second is the
-l
option, which needs no argument, and which
lists the shells that are available on your system. The operand for
the chsh
command is the username of the user
whose shell is being changed. So, to change your default shell
program, you might first type:
% chsh -l
which gives you a list of the shell programs available on the system:
/bin/bash /bin/sh /bin/ash /bin/bsh /bin/bash2 /bin/tcsh /bin/csh /bin/ksh /bin/zsh
Then, to actually change your shell to tcsh
, you
can type:
% chsh -s /bin/tcsh yourusername
Options can simply be single-letter codes, or they can have their own
arguments. Options that take no arguments can be given as a group,
while each option that takes an argument must be specified
separately. Each option group and separate option must be preceded by
a hyphen (-
). The last option in a group, or
separate options, can be followed by the option argument. The
operands follow the final option in the list.
Many Unix commands have options that, frankly, you'll never use. And we're not going to talk about them. But there are ways of finding out more.
Unix has its own built-in reference manual, which is quite comprehensive and informative, and which will give you the correct information about the commands and options available on the particular system you're using.
The man
command is one of the most useful Unix
commands; it allows you to view Unix manual pages. While some Unix
systems have implemented a web browser-like interface to the Unix
manpages, you can't always count on this option being
available. The man
command is available on all
types of Unix systems.
Usage: man
name
|
where name can be a Unix command, such as
grep
, or a system file, such as the password
file /etc/passwd
.
If you're not sure of the command you're looking for, you
can sometimes find the right information using
man
's slightly smarter cousin,
apropos
. The apropos
command locates commands by keyword lookup.
Usage: apropos
name
|
For instance, if you're concerned about disk usage on your
system, you can enter apropos usage
. The output
of this command on our PC running Red Hat Linux is:
du (1) - summarize disk usage getrlimit, getrusage, setrlimit (2) - get/set resource limits and usage quota (1) - display disk usage and limits quotacheck (8) - scan a file system for disk usages
apropos
doesn't always produce such brief
and informative output. Entering a smart combination of keywords is
(as always with such searches) the key to getting the output you
want. If you want a predictable listing of Unix commands, it's
probably best to pick up a comprehensive Unix book.
What should you do if you find the following text in a manpage?
This documentation is no longer being maintained and may be inaccurate or incomplete. The Texinfo documentation is now the authoritative source.
The GNU[*] set
of Unix tools are adopting a documentation system, called
texinfo
, that is different from the traditional
man
system. If you come across this message, you
should be able to read the up-to-date documentation on the program by
typing in the command info
progname. For instance, info
info
gives you a complete set of documentation on the use
of info
and even provides instructions for
creating your own info
documentation when you
start writing your own programs.
By default, many Unix commands read from standard input and send their output to standard output. Standard input and output are file descriptors associated with your terminal. A program reading from standard input will simply hang out and wait for you to type something on your keyboard and press the Enter key. A program writing to standard output spews its output to your terminal, sometimes far faster than you can read it.
Some Unix commands read a hyphen (-
) surrounded
by whitespace on either side as "data from standard
input." This construct can then be used in place of a filename
in the command line. Absence of an output filename is sufficient to
cause the program to write to standard output.
The standard input and output descriptors are useful because you can redirect both standard input and output, associating them with filenames, with no effects on the functioning of the program. Here are the most common redirection constructs used by the C shell:
This redirector preceding a filename associates that filename with standard input, i.e., the contents of the file are presented to the program as if they are standard input.
This redirector associates a filename with standard output, so that the filename is created on execution of the command, or whatever is in an existing file of that name is overwritten by the output of the command.
This redirector associates a filename with standard output. It differs from > in that the output of the command is appended to the end of the existing file.
The cat
command reads the contents of a file and
writes them to standard output. If you want to use the
cat
command to combine the contents of three
files into one new file, you can use a redirector like this:
% cat file1 file2 file3 > file4
This construct with cat
would be useful if, for
example, you'd just downloaded a bunch of individual sequence
files from the NCBI web site and want to collect them into one large
file that can be read by another program. (This is an example of
something that seems like it should be simple, but is actually
time-consuming and annoying to do with a standard PC word-processing
program. Unix provides a neat solution that doesn't even
require you to open any files).
You can also use redirectors to direct the contents of a file into a program at run-time, as standard input (useful if you are running a program that prompts you for input from the keyboard) or to capture output from a program that is normally written to standard output:
program < inputfile program > outputfile
For example, let's say you've just finished an extensive
BLAST search, and you want to send the results to your colleague. You
can use the redirector <
("less
than"), to scoop the file
huge_blast_report
out of your directory and mail
it directly to your colleague:
% mail [email protected] < huge_blast_report
If you want to increase the chances of your colleague opening the
message, you can add a subject header to the mail message using the
mail option -s
. The command reads:
% mail -s "surprise!" [email protected] < huge_blast_report
The reverse operation, sending the results of standard output (or
text that's displayed on your screen) to a file, can be
accomplished using >
("greater
than"). Perhaps your colleague wants to write a quick reminder
to herself to reply to your mail. She could do it using the
cat
command to take input from the keyboard and
redirect it to a file, like this:
% cat > reminder_to_self Ha! Send fifteen BLAST reports to colleague on Monday. ^D %
Ctrl-D (^D) signals that you have finished entering text. Your
colleague now has a file called reminder_to_self
in her current working directory.
Operators are similar to redirectors in that they are ways of directing standard input and output. However, they direct input and output to and from other commands rather than to filenames.
The most commonly used operator is the
pipe (|
). The pipe
directs standard output of one command into standard input for the
next command. This allows you to chain together several different
filtering commands or programs without creating input or output files
each time.
You can use the cat
command to direct the
contents of a file into a program that reads information from
standard input:
% cat inputfile | program
This command construct does the same thing as the example we showed
earlier (program < inputfile
). Both cause the
output of the cat
command to act as input for
program
. If you want to do a lot of runs of the
same program using slightly different input, you can create multiple
input files and then write a script that cat
s
each of those input files in turn and pipes their contents to
program
.
Pipes can carry out a complete set of file-processing options without writing to disk. For instance, imagine that you have a datafile consisting of multiple tables concatenated together. The first table in the file takes up the first 67 lines, the second table takes up the next 100 lines, and the rest of the file is taken up by a third table.[†] You want the information that's contained in the second column of the middle table, which stretches from characters 30 -39 in the row. Using filters and pipes, you can construct the following command to crop out the data you need:
% head -167 protein1.pka | tail -100 | cut -c30-39 > protein1.pka.data
In this example, head
sends the top 167 lines of
a specified file or files (in this case
protein1.pka
) to standard output;
tail
takes the last 100 lines of the output of
head
; and cut
takes the
correct column of characters out of the results of
head
and tail
and then
stores it in protein1.pka.data
.
A useful
construct Unix shells recognize is the presence of wildcard
characters in filenames. The shell locates matches for any wildcards
before passing filenames on to the program. The two most commonly
used wildcards are the asterisk (*) and the question mark (?). *
means "any sequence of zero or more characters, except for the
/
character." ? means "any single
character." Thus, "every file in this directory"
can be denoted by a lone *, which is a useful shortcut.
The shell recognizes other wildcards as well. The construct
[cset ] refers to any characters in the
specified set. If you want to move all files beginning with letters
a through m to a new
directory, you can structure the command as mv [a -m]*
../newdir
. If you want to move all files beginning with a
number to a new directory, enter mv [0 -9]*
../newdir
.
On Unix systems running the X Window System, there are many commands available that initiate programs with functions that aren't command line-based. Once these programs, which can include anything from graphics viewers to complicated scientific applications, are called from the command line, they use the X Window System to open their own windows, which generally contain a complete, independent graphical user interface.
You're probably accustomed to the idea of using a program to open a file. If your first introduction to computers has been sometime in the last 15 years, you're probably used to simply clicking on a file icon, which is automatically recognized by the right piece of software, which opens the file.
In Unix, commands are designed to operate on files that are sensibly readable and printable as text whenever possible. Thus text files can be opened by a wide variety of commands that allow a great deal of flexibility in file manipulation. The file reading and processing commands have such functions as sorting data based on the value of a particular substring in each line of the file, cutting a particular column out of a file, pasting columns of data together side by side, checking to see what the differences between two files are, and searching for instances of a pattern in a file or group of files. Often, these simple commands are all you need to extract a desired subset of the data in a file and prepare it for analysis.
Unix has many ways to view and edit the contents of files. There are viewers for text and programs that allow you to examine the contents of binary files, as well as full-featured editors for modifying plain-text files.
Usage: cat
-[
options
]
files
|
cat
dumps the contents of a file onto the
screen. If your file is short, or if you've successfully
completed a speed-reading course, this utility works well. If you
need to see what's on each page of a file, though,
cat
is less useful, since the contents of the
file scroll by without pausing.
Instead of viewing text, cat
is most useful for
combining (or concat enating) files. For
instance, if you have a series of files of program output named
meercat1.txt
, meercat2.txt
,
and meercat3.txt
, and you want to combine them
into a single file, you can type:
% cat meercat1.txt meercat2.txt meercat3.txt > big-meercat.txt
This command appends the contents of
meercat3.txt
to the end of
meercat2.txt
, the contents of
meercat2.txt
to the end of
meercat1.txt
, and so on, combining them into one
big file named big-meercat.txt
. If you've
thought to number the outputs sequentially (as we have with the
meercats), and want them in that order in the file, you can just
type:
% cat meercat*.txt > big-meercat.txt
and it will have the same effect. Wildcard characters such as * use a
strict alphabetical order: if they exist, files
meercat10.txt
and
meercat11.txt
come before
meercat2.txt
.
cat
can also append files to the end of an
existing file. For example, if your program generates another output
file you need to attach to the end of the collection, the command:
% cat meercat10.txt >> big-meercat.txt
does just that. If you use >
instead of
>>
in this situation, instead of being
added at the end of the file, the new file
meercat.txt
overwrites the entire contents of
big-meercat.txt
.
Incidentally, if you want a command that's the reverse of
cat
to print the lines of a file in backward
order, you're in luck: the command is called
tac
. Sadly, the command
acta
, for printing a file inside out,
hasn't yet been implemented.
Usage: more
-[
options
]
[+
linenumber
]
[+/
pattern
]
filename
|
more
is a pager, which in Unix means a
program that lets you view a file one page at a time. Suppose you
have a file containing BLAST output named
blast-first.txt
. Typing:
% more blast-first.txt
shows you the first page of the file
blast-first.txt
, and steps forward one page
every time you press the space bar. To leave
more
, hit the q
key; to
view other more
commands while within
more
, enter h
.
more
is smart about moving around files. If you
know where you want to go in the file, you can specify the line
number (using the
+
linenumber option). If, on
the other hand, you want to start at the first occurrence of a
certain word or pattern, use the
+/
pattern option. When
viewing a file in more
, if you press the
/
key and then type a pattern to search for,
more
jumps to the next occurrence of that
pattern in the file and repeats searches for each subsequent
occurrence of that pattern every time you press
/
followed by the Enter key.
Here are some other useful options for more
:
Shows normally unprintable control characters as well as normal text
Squeezes multiple empty lines into a single one
You can redirect the output of a program that generates more than a
screen's worth of text to more
, allowing
you to page through the output one screen at a time. Let's say
you want to know who is logged into your Unix system. If enough users
are logged in, the output scrolls off the screen. By piping
who
to mor e
:
% who | more
you can scroll through the output line-by-line using the Return key or screen-by-screen using the space bar.
more
's most significant shortcoming is
that some versions can't move backward through a file.
less
is a utility that remedies this simple
problem.
There is a superior pager command,
less
. Most importantly,
less
rectifies more
's biggest flaw: it lets you page backward as well
as forward in a file. less
also doesn't
load a file into memory all at once, which makes it less likely that
your computer will grind to a halt if you view a huge file with it.
Finally, it also handles binary files more gracefully, displaying
readable text as characters and representing unreadable control
characters in the form ^X
.
less
uses the same options as
more
, but it also takes additional options. Be
sure to check info less
to see which ones your
local version takes. And finally, while it hardly bears mentioning,
why is it called less
? Because
less
is more
. Sigh.
Usage: vim
filename
|
Because it's a text-based operating system that has historically been used for software development and computation, Unix did not traditionally provide the kind of full-featured, "what you see is what you get" text editing that exists on personal computers, although now such editors are available. In fact, WYSIWYG text editors are of limited utility for programmers because they often introduce invisible markup characters into documents.
It's worth learning to use the plain-text editors that are provided for Unix. They have a fairly steep learning curve, but they are the right tools for the job if you're writing programs or looking at plain-text data. If you download sequence data from a web server and open and work with it in a plain-text editor, the file you write out should be readable by a sequence-analysis program. If you opened the same file and worked with it in a WYSIWYG editor, then wrote it out in the file format used by that editor, it would be unreadable by other programs.
The
vi
editor is a standard feature of most Unix
systems. It's a full-screen editor; it allows you to see as
many lines of the file that you are editing as will fit into the
terminal screen or window in which you run it. The cursor can be
moved through the file using keyed instructions, but it can't
be moved with the mouse. The bottom line on the screen is called the
status line. Error messages from
vi
appear in the status line.
In Section 5.6, we discuss
the use of regular expressions for searching and replacement as a
feature of the plain-text editor vi
. The ability
to use vi
with the regular-expression language
makes vi
a powerful tool for file manipulation.
A few nice features have been added to vi
in
vim
(vi
improved).
It's worth asking your system administrator to install
vim
if it's not already on your system, if
only for the multiple undo feature that it introduces. We can't
cover all the features of vim
here, but we will
present a few commands that will get you up and running.[‡]
vim
has three
modes ; in each, input from the keyboard
is interpreted differently:
This is the main mode; you are automatically in command mode when you start working. Keystrokes are interpreted as vim's short commands, most of which consist of one or two letters. You can always return to the command mode by hitting the Escape key once (or sometimes twice).
This mode is reached by issuing any command that requires input.
This mode is for issuing longer, more complex commands. To reach status line mode, simply type a semicolon (;) in command mode. A semicolon appears at the left side of the status line, and anything you type appears in the status line. When you finish typing your command and hit the Enter key, the command is executed, and you return to command mode.
Here are some of the most useful vim
command-mode commands:
Moves the cursor around in your file character-by-character or line-by-line. It's sort of like a pre-joystick video game: "h" moves you to the left, "l" to the right, "j" moves you down a line, and "k" up a line. On most systems, the arrow keys on your keyboard will also work to move you around within vim.
Moves the cursor forward ("w") or back ("b") by one word in the text. Words are delimited by whitespace.
Moves the cursor forward ")" or back "(" by one sentence in the text. Sentences are recognized as sequences of words terminated by an end-of-sentence character (. ? !).
Initiates the insertion of text. "a" and "A" insert text after the cursor and at the end of the current line, respectively. "i" and "I" insert text before the cursor and at the beginning of the current line. "o" and "O" open a blank line below or above the current line, respectively, and begin inserting text on the new line.
Deletes the text under the cursor or before the cursor, respectively. Preceded by an integer number, they delete that number of characters after or preceding the cursor.
Substitutes for the character under the cursor or for the current line, respectively, by deleting the character either under the cursor or the line and initiating insertion of text in place of the deleted character. Preceded by an integer number, "s" replaces that number of characters with the new text, and "S" replaces the specified number of lines.
Here are some of the most useful vim
status line
mode commands:
Saves changes to the file and quits the editing session. ":w" can be used by itself or with the name of the file to write to. ":q!" exits the session without saving changes.
Followed by a filename, inserts the entire text of the named file.
Searches for and replaces pattern with replacement throughout the buffer. If the trailing "g" is left off, only the first occurrence of the pattern in any line is replaced.
Moves the cursor to the specified line number.
vim
is
a
fairly flexible editor, and you can certainly learn to make it do any
text-editing task that you need to do. However, there are other
options for text editing on Unix systems. The best of these is
probably the Emacs editor. Emacs is an editing program made available
by the Free Software Foundation. It contains not only a text-editing
facility with special modes for TEX and
LaTEX documents, programs in various
programming languages, and outlines, but also a file manager, mail
and news readers, and access to the online documentation browser
info
. Whole books have been written on Emacs
(see the Bibliography) so we won't go into it
here except to recommend that, if you're working on a Unix
system, learning to use Emacs is one of the better uses of your
learning-curve time.
Usage: strings
-[
options
]
filenames
|
In addition to the text files we've discussed up to now, there are also binary files that can't be read as text. They are almost always the output of a program or the executable form of a program itself (as opposed to the source code). Binary files and program executables aren't human-readable because they are in machine language. Because of this language gap, we'll unflinchingly make the prediction that, 9 times out of 10, it isn't worth the effort needed to read binaries. You'll have more luck taking another route, like talking to the person whose program created the file in the first place. Unfortunately, many programs today, such as commercial hidden Markov model software or data mining programs that directly write their internal representation of data structures to disk, use binary files to store proprietary data structures. For that tenth time, then, we present some tips on how to extract information from binaries without going crazy.
Your first step should be to use either the less
command described earlier or the
strings
command. If any portions of the file are
in plain text, they will be readable in less
.
The strings
command cuts out any readable text
characters in the file and prints them to the screen. For example, if
you have an undocumented binary file named
badger
and want to see if it contains any clues
as to what it does, try typing:
% strings -n 3 badger | less
(The -n
option tells
strings
how many readable text characters in a
row constitute a string. The default setting for
-n
is four). Piping the output to
less
will let you page through it if it's
longer than one screen. If the output looks like:
ATCGTACTGATCGTCGATCGTCGATCATGCA CGTAGCAGTCGATCATCATCGTACTAGCTAG ATGCCTGAGCTATACACACTAGTCACGATGC
you might guess that badger
contains some kind
of binary encoding of data including a nucleotide sequence or a (not
good) multiple sequence alignment.
Usage: od
-[
options
]
filenames
|
Sometimes,
it may be necessary to do more than just identify a binary file. In
these cases, the od
program may provide a first
step in understanding the file's contents. Before looking at
od
itself, let's take a quick detour
through the ways in which binary information is represented in a
moderately more human-friendly form.
Rather than using conventional decimal (base-10) notation, binary data is usually represented using a base that is a power of two: either octal (base-8) or hexidecimal (base-16) digits. Octal numbers are usually preceded by a 0. For example, the decimal number 25 corresponds to octal 031.[§] Hexidecimal digits, on the other hand, are usually preceded by a 0x and use the letters A through F to represent the decimal numbers 10 through 15. The decimal number 25 is 0x19 in hexidecimal.
If you want to delve into the heart of the binary file and see
what's going on, you can use the od
command to perform an octal dump (or hex dump)
and see if your binary file is readily interpretable. Typing:
% od -c badger | less
creates an octal dump of badger
you can step
through a page at a time. It should look something like this:
0000000 001 006 R T D C Y G 0000020 006 R T D C Y G a 0000040 001 1 003 A R G 001 0000060 002 C A 002 B Z 270 R ? 200 @
o d
's primary options are:
Prints out text characters corresponding to bytes
Produces a hex dump of the file
Produces an octal dump (the default setting)
Produces a dump of unsigned decimal numbers
Unless you're a serious programmer, you're not likely to have to read binaries. However, on the off chance that you do, we hope these standard tools will help you start to get your questions answered.
Filters are programs that take input data and transform it to produce output. They can accomplish tasksk —such as extracting parts of files—that word processing and spreadsheet applications can't. A transformation involves a simple manipulation of the data format, or selection of specified lines or fields from the data. In this section, we discuss some of the more commonly used filters that are part of Unix. These filters can read from standard input and writing to standard output, allowing you to combine them and produce fairly complex transformations.[‖]
Usage: head
-
number
files
|
Say you have a program that spits out a lengthy datafile that has
several different tables of information concatenated together.
Leaving aside the question of why anyone would write a program that
creates such difficult output, there are commands that allow you to
work with such data, and you need to know them.
head
is one such command.
By default, head
sends the top 10 lines of the specified
file or files to standard output. Checking the head of a file this
way is an easy way to see if there's something in the file
without opening it using an editor or doing a full
cat
of the file.
With the -number
flag, head
becomes a tool for selecting a specified number of records from the
top of a file. Combinations of head
and
tail
commands can extract any set of lines from
a file provided that you know their location in the file.
Usage: tail
[-f]
-
number
files
|
The tail
command outputs the last 10 lines of a
file by default, or the last num lines of the
file if specified. With the -f
option,
tail
gives constantly updated output of the last
few lines in the file. This provides a way to monitor what is being
added to a text output file as it's written by another program.
Usage: split
-[
options
]
filename
|
Usage: csplit
-[
options
]
file
criteria
|
The
split
command allows you to break up an existing
file into smaller files of a specified size. Each file produced is
uniquely named with a suffix (aa
,
ab
...az
,
ba
,
etc.). The options to
split
are:
Splits the file into subfiles of length lines
Uses length letters to form suffixes
If you have a file called big-meercat.txt
and
you want to split it into subfiles of length 100 lines using
single-letter suffixes and writing the files out to subfiles named
meercat.*
, the command form of the
split
command is:
% split -l 100 -a 1 big-meercat.txt meercat.
csplit
also splits files into subfiles, but is
somewhat more flexible than split
, because it
allows the use of criteria other than number of lines or bytes for
splitting. Here are csplit
's options:
Uses the specified file prefix to form subfile names
Uses suffixes of a specified length to form subfile names; subfile suffixes are made up of numbers rather than letters
Split criteria are formed in two ways: either a regular expression is supplied as the criterion, possibly modified by an offset, or a number of lines can be specified.
A biological sequence database in FASTA format may contain many records of the form:
>identifying header information PROTEINORNUCLEICACIDSEQUENCEDATA
The csplit
command can split such a database into
individual sequence files using the command:
% csplit -f dbrecord. -n 6 fastadbfile /^>/
The file is split into numbered subfiles, each containing a single sequence.
Usage: cut
-c
list
filenames
|
or cut
-f
list
-d
delim
-
s
file
|
The cut
command outputs selected parts of each
line of an input file. A line in a file is simply any stretch of
characters that ends with a specific delimiter; a delimiter is a
special nontext character an operating system or program recognizes.
Lines in files are terminated with an
EOL
(end-of-line) character; files
themselves are terminated with an EOF
(end-of-file) character. These characters are usually invisible to
you when you're working with the file, but they are important
in how a file is treated by programs that read it.
For example, say you have a file called
sequence_data
that contains the following:
ATC TAC ATG CCC GAT TCC
Here's how to use cut
to output the first
character of each line in the file:
% cut -c 1 sequence_data A A G
And here's how to output the first line of fields 1 and 2:
% cut -f 1-2 sequence_data AAT TAC
Portions of each defined line can be selected by character number in
the line with the -c
option, or by field with
the -f
option. Fields are stretches of
characters within a line that are defined by delimiters. The most
obvious delimiter for use within the text of a file is simply the
space character, but other characters can be used as well. Fields are
different from columns, which are strictly defined by numbering each
character in the input line.
The list argument specifies the range of each line, whether in characters or in fields, to be selected.
The list is in the form of single numbers or of two numbers separated
by a -
character. Multiple single columns or
ranges can be selected by separating them with commas. Either the
first or the last number can be omitted, indicating that the cut
starts at the beginning of the line or that it ends at the end of the
line. Characters and fields in each line are numbered starting at 1.
When the -f
option is used, indicating that
cut
is to count fields rather than characters, a
delimiter other than the default tab character can be specified with
the -d
option. The -s
option causes cut
to ignore lines that
don't contain the specified delimiter. This option can be
useful, for example, for ignoring header lines in a table.
Usage: paste
-[
options
]
files
|
The
paste
command allows you to combine fields from
several files into one larger file. Unlike the
join
command, which does a database-style
merging of two files, paste
is a purely
mechanical combination of files. Lines are combined based solely on
their line number in each file: i.e., the first line of
file1 is pasted next to the first line of
file2, regardless of the content of the lines.
Pasted data is separated by a tab character unless another delimiter
is specified with the -d
option. With the
-s
option and only one input filename,
paste
joins all the lines in the input file into
one long line.
paste
can prepare datafiles to be read by
data-analysis applications. If you have a group of files in the same
format and you have used other filter commands to remove
corresponding information from each of them, you can prepare one
input file that allows you to plot the corresponding information from
each of the files without reading them independently. In a previous
example, we used piped commands to extract a column from a table in a
complicated output file:
% head -167 protein1.pka | tail -100 | cut -c30-39 > protein1.pka.data
If you have eight similar output files for proteins 1-8, you can process them all in the same way and then paste the results that you're interested in comparing into one big datafile:
% paste protein*.pka.data > allproteins.pka.data
Each individual file in this example might look something like this:
3.8 12.0 10.8 4.4 4.0 6.3 7.9
Each number represents the computed pKa value of one amino acid in a
protein. If you have several sets of results that can be meaningfully
combined into a table, paste
creates a simple
tab-delimited table that looks like this:
3.8 3.2 3.6 12.0 12.9 12.5 10.8 10.9 11.0 4.4 4.2 4.5 4.0 3.9 4.2 6.3 6.5 6.2 7.9 7.5 8.0
It's up to you, however, to understand how your data can be meaningfully combined into a table and to use the paste command correctly to get the result you want.
Usage: join
-[
options
]
file1
,
file2
|
join
merges two files based on the contents
of a specified join field, where lines from the two files having the
same value in the join field are assumed to correspond. Files are
assumed to have a tabular format consisting of fields and a
consistent field separator, and are assumed to be sorted in
increasing order of their join fields.
Command-line options for join
include:
Uses the specified field number as the join field in file 1
Uses the specified field as the join field in file 2
Uses the specified character as the delimiter throughout the join operation
Replaces empty output fields with the specified string
Produces output for each unpairable line in the specified file; can be specified for both input files; fields belonging to the other output file are empty
Produces output only for unpairable lines in the specified file
Constructs the output lines from the list of specified fields, where
the format of the field list is
filenum.fieldnum
; multiple items in the list can
be separated by commas or whitespace
join
is quite useful for constructing data
tables from multiple files, and a sequence of
join
operations can construct a complicated
file. In a simple example, there are three files:
mustelidae.color: badger black ermine white long-tailed tan otter brown stoat tan mustelidae.prey: ermine mouse badger mole stoat vole otter fish long-tailed mouse mustelidae.habitat: river otter snowfield ermine prairie long-tailed forest badger plains stoat
First, combine mustelidae.color
and
mustelidae.prey
. The field both have in common
is the name of the animal, which is the first field in each file.
mustelidae.prey
isn't yet sorted. The form
of the join
command needed is:
% sort mustelidae.prey | join mustelidae.color - > outfile
which produces the following output:
badger black mole ermine white mouse long-tailed tan mouse otter brown fish stoat tan vole
Now combine the resulting file with
mustelidae.habitat
. If you want the resulting
output to be in the form habitat animal prey
color, use the command construct:
% sort -k2 mustelidae.habitat | join -1 2 -2 1 -o 1.1,2.1,2.3,2.2 - outfile
This operates on the standard input and the output file from the previous step to produce the output:
forest badger mole black snowfield ermine mouse white prairie long-tailed mouse tan river otter fish brown plains stoat vole tan
Usage: sort
-[
general
options
]
-
o
[
outfile
]
-[
key
interpretation
options
]
-
t
[
char
]
-
k
[
keydef
]...[
filenames
]
|
The sort
command can sort a single file, sort a
group of files and simultaneously merge them into a single file, or
check a file for sortedness. This function has many applications in
data processing. Each line in the file is treated as a single field
by default, but keys can also be defined by the user on the command
line.
The main options for sort
are:
Tests a file for sortedness based on the user-selected options
Merges several input files
Displays only one instance of lines that compare as equal
Sends the output to a file instead of sending it to standard output
Uses the specified character to delimit fields
Options that determine how keys are interpreted can be used as global
options, but they can also be used as flags on a particular key. The
key interpretation options for sort
are:
Ignores leading or trailing whitespace in a sort key.
Reverses the sort order for a particular key.
Uses "dictionary order" in making comparisons; i.e., characters other than letters, digits, and whitespace are ignored.
Reclassifies lowercase letters as uppercase for the purpose of making
comparisons. Normally, L and l would be separated from each other due
to being in uppercase and lowercase character sets; with the
-f
flag, all L's end up together, whether
capitalized or not.
Key definitions are arguments of the
-k
option. The form of a key definition is
position1,position2. Each is a numerical value
that specifies where within the line the key starts and ends.
Positions can have the form field.character,
where field
specifies the field position in the
input line, and character
specifies the position
of the starting character of the key within its individual field. If
the key is flagged with one of the key interpretation options, the
form of the key is field.character[flags]. If
the key interpretation option isn't applied to the whole sort,
but merely to one key, then it's appended to the key definition
without a preceding hyphen.
It's frequently useful to find out if two separate files are the same and, if not, where they have differences. For instance, if you have compiled a program on your local machine, and test cases are provided, you should run your copy of the program on the test cases and compare the output to the canonical output provided by the makers of the program. If you want to check that the backup copy of a file and the current version of the file are the same, file-comparison tools are very useful. Unix provides tools that allow you to do this without laboriously searching through the files by hand.
Usage: cmp
-[
options
]
file1
file2
|
Usage: diff
-[
options
]
file1
file2
|
Let's say you have two lists and, while they look similar, you
can't tell by eye if they are exactly the same list. This can
happen if you get a list of gene names back from database searches
performed using two subtly different queries and want to know if they
are equivalent. In order to compare them rigorously (and save your
eyes in the process), you can try the semicomplementary commands
cmp
and diff
. In short,
cmp
tells you whether two files are identical,
and diff
prints any lines that are different.
cmp
is fairly simple-minded. Typing:
% cmp enolase1.list enolase2.list
produces no output if the two files are identical. Otherwise,
cmp
returns a message that the files differ and
includes the character and line at which the first difference occurs.
diff
is most useful for comparing different
versions of a file to find exactly where the files differ. Before
looking at diff '
s rather obtuse output,
it's worth a moment to see how to decrypt it. Without options,
diff
responds with a list of differences in the
form of the changes required to make file2
from
file1
:
Lines x through y in
file1
are missing in file2
after line i (i.e., they've been deleted
from file2
).
Lines x through y in
file2
are missing in file1
after line i (i.e., they've been added to
file2
).
Lines i through j in
file1
have been changed to lines
x through y in
file2
.
In practice, the output looks like this (where
enolase1.txt
and
enolase2.txt
are lists of names of putative
enolases produced by two database searches performed at different
times):
% diff enolase1.list enolase2.list 1a2 > ENO_MESCR 5a7 > ENOA_MOUSE
Here are two of the more immediately useful options
diff
uses:
Ignores differences in whitespace between lines
Ignores inserted or deleted blank lines between files
The info
pages on diff
and
its variants are especially helpful. If you use this utility
extensively, we strongly recommend you give them a look.
Usage: wc
-[
options
]
filename (s )
|
wc
is a simple and useful utility for
counting things in text files. Given a text file,
wc
counts the number of lines, words, and bytes
(characters) that it contains. The default setting for
wc
is to count all three entities, so that
typing it at the command prompt returns a line that looks like:
% wc meercat1.txt 27 98 559 meercat1.txt
This output tells you that there are 27 lines, 98 words, and 559
bytes in meercat1.txt
. If you pass multiple
files to wc
, it returns counts both for
individual files and for all of them combined. For example, if you
run wc
on the three meercat files:
% wc meercat1.txt meercat2.txt meercat3.txt
(or, to save time, wc meercat*.txt
, being
appropriately careful using the wildcard), the output looks like:
41 130 905 meercat1.txt 50 124 869 meercat2.txt 10 19 156 meercat3.txt 101 273 1930 total
These are the options for wc
:
Counts only bytes (characters)
Counts only words
Counts only lines
Prints a usage message
Prints the version of wc
being used
Unix tools can often be used in combination to collect information
you need. For instance, say you have a list of 1,000 files that need
to be processed, and the output files are all saved together in the
same directory. Instead of trying to list the contents of that
directory using ls
, you can use ls
-1
dirname
| wc
to find how many output files have been created so far.
The pattern-matching language known as
regular
expressions
allows you to search for and extract matches and to replace patterns
of characters in files (given the right program). Regular expressions
are used in the vi
and Emacs text-editing programs. Since
much of the data that biologists work with contains patterns, one of
the first skills you need to learn is how to match patterns and
extract them from files.
Regular expressions also are understood by the Perl language interpreter. Knowing how to use regular expressions along with the basic commands of Perl gives you a powerful set of data-processing tools. We'll cover the basics of regular expressions here, and return to them again in Chapter 12.
If you've ever used a wildcard character in a search, you've used a regular expression. Regular expressions are patterns of text to be matched. There are also special characters that can be used in regular expressions to stand for variable patterns, which means you can search for partial or inexact matches. Regular expressions can consist of any combination of explicit text and special characters.
The special characters recognized in basic regular expressions are:
The backslash acts as an escape character for a special character
that follows it. If part of the pattern you are searching for is a
dot, you give the regular expression chars.txt
to find the pattern chars.txt
.
.
The dot matches any single character.
The behavior of the asterisk in regular expressions is different from its behavior as a shell wildcard. If preceded by a character, it matches zero or more occurrences of that character. If preceded by a character class description, it matches zero or more characters from that set. If preceded by a dot, it matches zero or more arbitrary characters, which is equivalent to its behavior in the shell.
The caret at the beginning of a regular expression matches the beginning of a line. Otherwise, it matches itself.
The dollar sign at the end of a regular expression matches the end of a line. Otherwise, it matches itself.
A group of characters enclosed in square brackets matches any single
character within the brackets. [badger] matches any of (a, b, d, e,
g, r). Within the set, only -, caret, ], and [ are special. All other
characters, including the general special characters, match
themselves. A range of characters in the form
[c1
-
c2
] can also be given; e.g., [0 -
9] or
[A-
Z].
Usage: grep -[
options
]
'
pattern
'
filenames
|
grep
allows you to search for patterns (in
the form of regular expressions) in a file or a group of files. GNU
grep
(the standard on Linux) searches for one of
three kinds of patterns, depending on which of the following
functions is selected:
Standard grep
: searches for a regular
expression (this is the default)
Extended grep
: searches for an extended regular
expression
Fast grep
: rapidly searches for a fixed string
(a pattern made of normal characters, as opposed to regular
expressions)
Note that the -E
and -F
options can be explicitly selected by calling
egrep
or fgrep
on some
systems. If no files are specified to be searched,
grep
searches the standard input for the
pattern, allowing the output of another program to be redirected to
grep
if you are looking for a pattern in the
output.
As a simple example, consider the following commands:
% grep -c '>' SP-caspases-A.fasta SP-caspases-B.fasta % grep '>' SP-caspases-A.fasta SP-caspases-B.fasta
These both search through a file of FASTA-formatted sequences (whose
header lines, you will remember, begin with the
>
symbol). The first command returns the
number of sequences in each file, while the second returns a list of
the sequence headers. Be sure to enclose the
>
in quotes, though. Otherwise, as one of us
once found out the hard way, the command is interpreted as a request
for grep
to search the standard input for no
pattern and then redirect the resulting empty string to the files
listed, overwriting whatever was already there.
grep
takes dozens of options. Here are some of
the more useful ones:
Prints only a count of matching lines, rather than printing the matching lines themselves
Ignores uppercase/lowercase distinctions in both file and pattern
Prints lines and line numbers for each occurrence of a pattern match
Prints filenames containing matches to pattern, but not matching lines
Prints matching lines but not filenames (the opposite of -l
)
Prints only those lines that don't contain a match with pattern
(quiet mode) Stops listing matches after the first occurrence
In protein structure files, protein sequence information is stored as a sequence of three-letter codes, rather than in the more compact single-letter code format. It's sometimes necessary to extract sequence information from protein structure files. In real life, you can do this with a simple Perl program and then go on to translate the sequence into single-letter code. But you can also extract the sequence with two simple Unix filter commands.
The first step is to find the SEQRES records in the PDB file. This is
done using the grep
command:
% grep SEQRES pdbfile > seqres
This gives you a file called seqres
containing
records that look like this:
SEQRES 1 357 GLU VAL LEU ILE THR GLY LEU ARG THR ARG ALA VAL ASN 2MNR 106 SEQRES 2 357 VAL PRO LEU ALA TYR PRO VAL HIS THR ALA VAL GLY THR 2MNR 107 SEQRES 3 357 VAL GLY THR ALA PRO LEU VAL LEU ILE ASP LEU ALA THR 2MNR 108
Not all the characters in each record belong to the amino-acid
sequence. Next, you need to extract the sequences from the records.
This can be done using the cut
command:
% cut -c20-70 seqres > seqs
The output of this command, in the file seqs
,
looks like this:
GLU VAL LEU ILE THR GLY LEU ARG THR ARG ALA VAL ASN VAL PRO LEU ALA TYR PRO VAL HIS THR ALA VAL GLY THR VAL GLY THR ALA PRO LEU VAL LEU ILE ASP LEU ALA THR
If you don't want to create the intermediate file, you can pipe the commands together into one command line:
% grep SEQRES pdbfile | cut -c20-70 | paste -s > seqs.
Addition of the paste -s
command joins the
individual lines in the file
into
one long line.
The various Unix shells also provide a mechanism for writing multistep scripts that let you automate your work. Scripts are labeled as such because they contain, verbatim, the sequence of commands you want to "say" to the shell, just as the script for a play contains the sequence of lines the author wants the actors to say.
Shell scripts—even the simplest ones—are still applications, and they behave accordingly. Let's say you want to start a series of calculations that will take a while, and then go home to eat dinner. By default, the shell will wait until one command is finished to execute the next command, so if the second command acts upon the output of the first, it won't start prematurely. The important thing is that you don't have to be there to type the second command.
Here's a relatively simple example. Assume you have just
downloaded the entire set of GenBank DNA sequence files. You want the
information in the files, but you need it to be in a different format
so that a program you've downloaded can process it.
You're going to use the program gb2fasta
to convert the files from GenBank to FASTA format. (This script
assumes you've downloaded the GenBank files to your current
working directory.) Then you want to process each file using the
BLAST formatdb
program. To make the script more
flexible, you can write it so that it takes an optional file list on
the command line to specify which files to process. The script might
look like this:
#!/usr/bin/csh foreach file ($*) echo $file gb2fasta $file > $file.na formatdb -t "$file" -i $file.na -p F end
After creating the file, you need to make it executable using the
chmod
command. For instance, if the filename of
the script is blastprep
, give the command:
% chmod a+x blastprep
The first line of the script tells the operating system which shell program to use, the shell is invoked, and the job is run. You can invoke your command immediately in the following way:
./blastprep gbest*.seq
In order to run the new script without giving its full path, you need
to run the rehash
command before typing this
command. rehash
is a C-shell command that
updates the list of all executable files in your path.
In the previous example, all the GenBank EST files are automatically
parsed and prepared for use with BLAST. The programs
gb2fasta
and formatdb
run
just as they do on the command line, but you don't have to wait
for each command to complete. The script takes your command-line
argument—in this case gbest*.seq
, which is
a list of filenames—and sequentially fills the variable
$file
with each value. It then loops through the
lines between the "foreach" and "end" lines.
The echo
command simply sends the value of
$file
to standard output, so you can see in your
terminal window how the job is progressing. The
gb2fasta
program normally prints to standard
output, so you need to redirect the output to a specific filename. On
the other hand, formatdb
processes the input
files and generates new files using an internal naming convention, so
no output file is needed in the script.
As we'll see in Chapter 6, the ability to plug into other computers and networks across the world allows you to read and download an amazing amount of information, as well as share data with your colleagues. In fact, your work as a bioinformatician depends on having access to public databases and other repositories of biological data. In this section, we look at how your computer communicates with other machines and the tools it uses to do so.
The easiest way to communicate with other computers is via the Web. Most distributions of Linux include web browser software—usually Netscape—which, if you select it from the list of installation options, is automatically installed for you. Setting up a web browser on a Linux system is the same as setting up a browser on other computers; you need to set the browser's preferences and tell it where the correct utilities are located to open different kinds of file attachments.
You may want to maintain a web page on your machine, and in order to
do that, you need to install web server software. Again, most Linux
distributions allow you to install the Apache web server software as
one of your installation options. If you choose to install the Apache
web server, you can publish a simple web site by placing the
appropriate HTML files in the /home/httpd/html
directory.
In the world of the Internet, computers recognize each other by their Internet Protocol (IP) addresses. Computers that are constantly connected to the Internet have permanently allocated IP addresses and hostnames, while computers that only connect to the Internet occasionally may have dynamically allocated IP addresses, or no IP address at all, depending on the protocol they use to connect.
IP addresses consist of four numbers separated by dots (e.g.,
128.174.55.33). These are interpreted as directions to the
host (a computer that communicates with other
computers) by network software. Computers also have
hostnames, such as
gibas.biotech.vt.edu
. Name
servers are dedicated machines that maintain information
about the relationships among IP addresses and hostnames.
Usage: telnet
full.hostname
|
The telnet
command opens a shell on a remote Unix
machine; the workstation on which the command is issued becomes a
terminal for that machine. To telnet
to another
Unix machine, you must have a login on that machine. Once
you're logged in to the remote host, the shell works just as if
you were working directly on the remote machine.[#]
A "login:" prompt should appear, followed by a "password:" prompt after your ID is entered.
Usage:
ftp
full.host.name.edu
|
The
File Transfer Protocol
(ftp
) is a method for transferring files from
one computer to another. You may be familiar with Fetch, Interarchy,
or other PC-based FTP clients; Unix ftp
is
conceptually similar to these programs (and many of them have analogs
that run under Linux, if you like their graphical user interfaces).
When you use ftp
to connect to another host, you
will find yourself in an operating environment that is unique to
ftp.
Unix commands don't always work in
the ftp
environment, although the commands
ls
and cd
have similar
functions.
Again, a "login:" prompt appears, followed by a
"password:" prompt. If you are accessing an anonymous FTP
server (a common way to distribute software), the standard username
is anonymous
, and your email address is the
password. Once in the FTP environment, the most important commands to
know are:
Prints out the list of ftp commands.
help command
prints out information on a
specific command.
Lists the contents of the directory on the remote host.
Changes the working directory on the remote host.
Changes the working directory on the local host.
get copies a single file from the remote host to the local host.
mget
copies multiple files.
put copies a single file from the local host to the remote host.
mput
copies multiple files.
Changes the file-transfer mode to binary or ASCII. You should choose binary when you are downloading binary executables, images, and other encoded file formats.
Toggles the interactive mode that asks you to confirm every transfer when you transfer multiple files.
Sometimes you need to run an X program on another
computer and have it display on your terminal. This is relatively
simple to do. First, you need to set your own terminal to allow
remote displays from other hosts. This is done using the
xhost
command:
% xhost +
A confirmation that access is allowed from other hosts is then printed to standard output.
Next, you need to change the display environment on the remote
machine. This is done with the setenv
command:
% setenv DISPLAY yourmachine.yoursubnet.wherever.edu:0
Not all X applications running on a remote server can use your terminal for display, generally because the remote machine and your machine don't have the same graphics capabilities. For instance, programs running on a remote Silicon Graphics machine can't display on your local Linux workstation, because Silicon Graphics uses proprietary graphics libraries that aren't currently available to Linux users. However, even if both machines are compatible, bandwidth limitations can make running large X programs over the network extremely slow.
One of the biggest inconveniences for Linux users in a primarily Mac/PC environment is the sharing of files generated by PC productivity software with other users. While it's not our purpose to teach you to use these packages here, we can mention a few options that will help you handle communication with non-Unix users.
Fortunately, there are relatively low-cost software products
available for Linux that make it possible to work with common file
types, such as Microsoft Word and rich-text format (RTF) documents,
PowerPoint presentations, and Excel spreadsheets. Sun's
StarOffice (http://www.staroffice.com
) and Applix's
Applixware (http://www.vistasource.com
) are two
possibilities; at the time of this writing, StarOffice seemed to do
the cleanest job of converting files generated by Microsoft Word and
other commonly used programs. Adding one of these packages to your
Linux system will add most of the basic PC functions (word
processing, electronic presentations, etc.) that may be vital to your
work.
Most kinds of graphics files are easily handled and converted on
Linux systems. One powerful tool for manipulating graphics files is
called the GIMP (Gnu Image Manipulation Program, http://www.gimp.org
). The GIMP is commonly
included in Linux distributions, so be sure to select it as part of
your installation if you will be doing anything with graphics files.
The GIMP is analogous to Adobe Photoshop program and shares most of
the same functionality.
Linux users can read and write files on
Microsoft-formatted floppy disks and Zip disks. A floppy or Zip disk
is treated as an additional filesystem on your computer. The most
basic way to access this filesystem is to mount it using the
mount
command. To do this, you need to know the
device ID of the disk you are trying to mount and establish a mount
point for the new filesystem.
Determining the device IDs of the various drives is usually
straightforward. One way is to open the file
/var/log/dmesg
. This file contains the system
information that is printed to standard output when the machine is
booted. Scan through the file and find the drive information, which
should look like this:
hdc: SAMSUNG SC-140B, ATAPI CDROM drive hdd: IOMEGA ZIP 250 ATAPI, ATAPI FLOPPY drive hdc: ATAPI 40X CD-ROM drive, 128KB Cache Floppy drive(s): fd0 is 1.44M
This section of the file contains information about IDE devices. On
this particular machine, the IDE devices include a CD-ROM drive, a
Zip drive, and a floppy drive. The three-letter codes
hdc
, hdd
, and
fd0
are the device IDs.
The next section of the file contains information about SCSI devices.
On this particular machine, the main hard disk is a SCSI drive, and
its ID is sda
. sda1
,
sda2
, etc., are the individual IDs of the
partitions on the hard drive:
Detected scsi disk sda at scsi0, channel 0, id 0, lun 0 SCSI device sda: hdwr sector= 512 bytes. Sectors= 35566499 [17366 MB] [17.4 GB] sda: sda1 sda2 sda3 sda4 < sda5 sda6 sda7 sda8 sda9 >
Once you know the device IDs, mounting these new filesystems is simple. If you're the root user of your own machine, the command is:
mount -t [filesystem type
]devicefile mount point
For example, to mount a PC-formatted floppy disk at
/mnt/floppy,
the command is:
% mount -t msdos /dev/fd0 /mnt/floppy
You can find a listing of allowed file types in the manpages for
mount
.
As a shortcut, you can modify your /etc/fstab
file to contain the following lines:
/dev/fd0 /mnt/floppy vfat noauto,owner 0 0 /dev/hdd4 /mnt/zip vfat noauto,owner 0 0
On this system, the Zip drive is located at
/dev/hdd
. All PC-formatted Zip disks use
partition number 4, and the device file for that partition is
/dev/hdd4
. The noauto flag
means that these disks aren't mounted automatically at boot
time. Once these lines are added to /etc/fstab
,
the devices can be mounted with the shortened command
mount
devicefilename.
Once the Zip or floppy is mounted as a partition, the files on that disk can be treated like any other file on the system.
Getting some of these devices working isn't as straightforward as we'd like it to be. For further help, you can search the Web for the Linux how-to pages for the particular device you're using.
If you
install the utility package
mtools
and its graphical frontend
mfm
, you can run mfm
and
move files to Zip or floppy disks, using a graphical interface
similar to that on a PC. However, if you use this method to access
devices, you can't run Unix commands on the files stored on
your media until you move them onto the local hard disk.
By default, processes to access media may be run only by the root user. It's possible to configure your system so that other users can write to floppy and Zip drives. However, this creates a security hole in your system. You have to decide for yourself whether the benefits of easy disk access outweigh any potential risks.
Unix environments traditionally have been multiuser environments. While the availability of new flavors of Unix for personal computers might change this on your computer at home, at work you will probably use a shared, networked Unix system at least some of the time. And even on a personal Unix system, you need to be aware of problems that can arise when you create an excessive load on your system, and of how background processes can interfere with your ability to run interactive processes.
Because the Unix operating system can interact with more than one user at a time, from terminals attached directly to the system or over a network, there can be many processes executing on your system. Some processes will be yours, and others will belong to users who may be working across the room from you or hundreds of miles away. To be a good citizen in a Unix environment, you need to share the system's resources. While administrators of large public systems make it nearly impossible for you to be a bad citizen by implementing quotas for space usage and queueing systems for process management, it isn't likely that all systems you use will be so tightly managed. On shared systems in which good faith is all that's keeping users from stepping on each other's toes, it's wise to manage your own processes responsibly. Otherwise someone's going to come gunning for you, and it won't be pretty.
A Unix system carries out many different operations at the same time. Each of these operations, or processes, has a unique process ID that allows the user or the administrator to interact with that process directly.
There are a minimum number of processes that run on a system
regardless of whether you actively initiate them. Each shell program,
whether idle or active, has a process ID attached to it. Several
system (or root) processes, sometimes known as
daemons
,
are constantly active on the system. These processes often lie in
wait for you to initiate some activity they handle: for instance,
printing files, sending email, or initiating a
telnet
session.
Above and beyond this minimal system activity level are any processes you initiate by typing a command or running a program. The Unix kernel manages these processes, allocating whatever resources are available to the processes according to their needs.
Each process uses a percentage of the processing capacity of the
system's CPU or CPUs. It also uses a percentage of the
system's memory. When the processes running on a machine
require more than 100% of the CPU's capacity to execute, each
individual process will execute more slowly. While Unix does an
extremely good job of juggling hundreds of processes that run at the
same time without having the machine roll over and die, eventually
you will see a situation where the load on the machine increases to
the point that the machine becomes useless. The operating system uses
many techniques to prevent this, such as limiting the absolute number
of processes that can be started and swapping idle jobs out of
memory. Even on a single processor system, it's possible to
have multiple processes running concurrently as long as there is
enough space for both jobs to remain in memory. At the point at which
the CPU has to constantly wait for data to get loaded from the swap
space on the hard drive, you will see a great drop in efficiency.
This can be monitored using the
top
command, which is described in Section 5.9.1.3. Many machines are more
limited by lack of memory than they are by a slow CPU, and it's
often now more cost-effective to put money into additional RAM than
to buy the latest, greatest, and fastest CPU.
Usage: w
|
The w
command is available on most Unix
systems. This command can show you which other users are logged into
the system and what they are doing. It also shows the current load
average on the system.
The standard output of the w
command looks like
this:
2:55pm up 37 days, 4:50, 4 users, load average: 1.00, 1.02, 2.00 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT jambeck tty1 22Jan99 37days 3:55m 0.06s startx jambeck ttyp0 :0.0 Wed 5pm 1:34m 0.22s 0.22s -csh jambeck ttyp3 :0.0 21Feb99 3:47 9.05s 8.51s telnet weasel god ttyp2 around 2:52pm 0.00s 0.55s 0.09s create world
The first line of the output is the header. It shows the time of day,
how long the machine has been up, how many users are logged in, and
what the load average on the system has been for the last 1 minute, 5
minutes, and 15 minutes. The load average represents the fractional
processor use on the system. If you have a single processor system
and a load average of 1, the system is being used at optimal
capacity. A four-processor system with a load average of 2 is being
used at only half of its capacity. If you log in to a system and
it's already being used at or beyond its capacity, it's
not polite to add other processes that will start running right away.
The batch
or at
commands can
set up a process to start when resources become available.
The information displayed for each user is the username, the tty name, the remote host from which the user is logged in, the login time, the idle time, the JCPU and PCPU times, and what the user is doing.
Usage: ps
[
options
]
|
ps
produces a snapshot of what the
processor is doing at the moment you issue the command. Depending on
what your computer is doing at the time, typing
ps
at the prompt should give output along the
lines of:
PID TTY TIME CMD 36758 ttyq10 0:02 tcsh 43472 ttyq10 0:00 ps 42948 ttyq10 4:24 xemacs-20 42967 ttyq10 1:21 fermats-last-theorem-solver
Most of ps
's options modify the types of
processes on which ps
reports and the way in
which it reports them. Here are some of the more useful options:
Lists every command running on the computer, including those of other users
Produces a long listing of processes (process memory size, user ID, etc.)
Lists processes in a "tree" form, showing related processes
Notice that you don't need to preceed the option with a dash.
There are actually a couple of dozen options for ps
; check info ps
to see which options
are supported by your local installation.
Usage: top
-[
options
]
|
The
top
command provides real-time monitoring of
processor activity. It lists processes on the system, sorted by CPU
usage, memory usage, or runtime. The top
screen
looks like this:
4:34pm up 37 days, 6:29, 4 users, load average: 0.25, 0.07, 0.02 42 processes: 39 sleeping, 3 running, 0 zombie, 0 stopped CPU states: 42.9% user, 6.4% system, 0.0% nice, 51.0% idle Mem: 39092K av, 38332K used, 760K free, 13568K shrd, 212K buff Swap: 33228K av, 20236K used, 12992K free 8008K cached PID USER PRI NI SIZE RSS SHARE STAT LIB %CPU %MEM TIME COMMAND 516 jambeck 15 0 4820 3884 1544 R 0 30.4 9.9 4:23 emacs-fgyell 415 root 9 0 10256 9340 888 R 0 15.5 23.8 161:41 /usr/X11R6/b 10756 cgibas 5 0 716 716 556 R 0 2.3 1.8 0:01 top-ci
The header is similar to the output of w
but
more detailed. It gives a breakdown of CPU and memory usage in
addition to uptime and load averages. The display can be changed to
show a variety of fields. The default configuration of
top
is set in the user's
.toprc
file or in a systemwide
/etc/toprc
file.
Here are the top
options:
Updates with a frequency of delay
Refreshes without any delay, running at the highest possible priority
Runs in secure mode, with its most potentially dangerous commands disabled
Prints the full command line instead of just the command you're running
Ignores all processes except those currently running
While top
is running, certain interactive
commands can be entered, unless they are disabled from the command
line. The command i
toggles the display between
showing all processes and showing just the processes currently
running. k
kills a process. It prompts you for
the process ID of the process to kill and the signal to send to it.
Signal 15 is a normal kill; signal 9 is a swift and deadly kill that
can't be ignored by the process. r
changes
the running priority of a process, implementing the
renice
command discussed in Section 5.9.1.5. It prompts you for the process
ID and the new priority value for the job.
Usage: kill
[-s
signal
|
-p
]
[
-a
]
PID
|
The kill
command lets you terminate a process
abnormally from the command line. While kill
can
actually send various types of signals to a process, in practice
it's most often used in the form kill PID
or, if that fails to kill the process, kill
-9 PID.
On most systems, kill -l
lists the available
types of signals that can be sent to a process. It's sometimes
useful to know that jobs can be stopped and restarted with
kill
-
s
STOP
and kill
-
s CONT
. [**]
A PID is usually just the numerical process ID, which you can find
with the ps
or top
commands. It can also be a process name, in which case a group of
similarly named processes can be addressed. Another useful form of
PID is -n
process group ID,
which allows the kill
command to address all the
processes in a group simultaneously.
Usage: nice
-n
[
val
command
arg
]
|
Usage: renice
-n
[
incr
]
[-g|-p|-u]
id
|
Processes initiated on a Unix system run at the maximum allowed
priority unless you tell them to do otherwise. The
nice
and renice
commands
allow the owner of a process, or the superuser, to lower the priority
of a job.
If limited computing resources are shared among many users and computers are used simultaneously for computation and interactive work, it's polite to run background jobs (jobs that run on the machine without any interactive interface) with a low priority. Otherwise, interactive jobs such as text editing or graphical-display programs run extremely slowly while background jobs hog the available resources. Jobs running at a low priority are slowed only if higher-priority processes are running. When the load on the system is low, background jobs with low priority expand to use all the available resources.
You can initiate a command at a low priority using
nice
. n is the priority
value. On most systems, this is set to 10 by default and can range
from 1-19, or 0-20. The larger the number, the lower the priority of
the job, of course.
The renice
command allows you to reset the
priority of a process that's already running.
incr
is a value to be added to the current
priority. Thus, if you have a background process running at normal
priority (priority 1) and you want to lower its priority (by
increasing the priority number), you can enter renice -n
18
to increase the priority value to 19. You can also
input a negative number to put the job at high priority, but unless
you are root, you are limited to raising its priority to 1. The
renice
options, -p
,
-g
, and -u
, cause
renice
to interpret id as a
process ID, a process group ID, or a user number, respectively.
The cron
daemon, crond
, is
a standard Unix process that performs recurring jobs for the system
and individual users. System activities such as cleanup of the
/tmp
directory and system backups are typically
functions controlled by the cron
daemon. Normal
users can also submit their own jobs to cron
,
assuming they have permission to run cron
jobs.
Details about cron
permissions are found in the
crontab
manpage. Since the
at
and batch
commands,
which are discussed later, are also controlled by
cron
, most systems are configured to allow users
to use cron
by default.
Usage: crontab
-[
options
]
file
|
Submission of jobs to cron
is done using the
crontab
command. crontab -l
>
file
places the current contents
of your crontab
into a file so you can edit the
list. crontab file
sends the newly edited file
back and initializes it for use by cron
.
crontab -r
deletes your current
crontab
.
cron
processes the contents of all
crontab '
s and then initiates jobs as
scheduled. A crontab
entry as produced by
crontab -l
looks like:
# Format of lines: #min hour daymo month daywk cmd 50 2 * * * /home/jambeck/runme
This entry runs the program runme
at 2:50 A.M.
every day. An asterisk in any field means "perform this
function every time." In this entry, all output to either
STDOUT or STDERR is mailed to user jambeck
's email account on the machine where the
cron
job ran.
What if your group performs DNA sequencing on a daily basis, and you
want to use the sequence-alignment program BLAST to compare your
sequences automatically against a nonredundant protein database?
Consider this crontab
entry:
01 4 * * * find /data/seq/ -name "*.seq" -type f -mtime -1 -exec /usr/bin/csh /usr/local/bin/blastall -p blastx -d nr -i '{' ";'
This automatically runs at 4:01 A.M. and checks for all sequences that have been modified or added to the database in the last 24 hours. It then runs the BLASTX program to search your copy of the nonredundant protein sequence database for matches to your new sequences and mails you the results. This example assumes you have all the necessary environment variables set up correctly so that BLAST can find the necessary scoring matrixes and databases. It also uses a default parameter set, which may need to be modified to get useful results. Once you get it configured correctly, all you have to do is browse through your email while you drink your morning coffee.
Usage: at
-[
options
]
time
|
Usage: batch
-[
options
]
|
The batch
and at
commands are
standard Unix functions and are commonly available on most systems.
Jobs are submitted to queues, and the queues are processed by the
cron
daemon; jobs are governed by the same
restrictions as crontab
submissions. The
batch
command assigns priorities to jobs running
on the system. Using batch
allows a system
administrator to sort jobs by priority—high to
low—thereby allowing more important jobs to run first. Unless
the system has a mechanism to kill interactive jobs that exceed a
specified time limit, this use of the batch
queue relies on users to work in a cooperative manner. On larger
systems the function of batch
is usually
replaced by more complicated queuing systems. You need to get
information from your system administrator about which
batch
and at
queues are
available.
at
allows you to submit a job to run at some
specified time. batch
sequentially runs jobs
whenever the machine load drops below a specified level and the
number of concurrent batch
jobs has not been
exceeded. Once you initiate at
or
batch
, all command-line entries are considered
part of the job until you terminate the submission with a Ctrl-D
keystroke. Like cron
, any STDOUT and STDERR
generated by the job are mailed to you, so you at least get notified
of error conditions. Here are the common options:
Specifies the queue. By default, at
uses the
"a" queue; batch
uses the
"b" queue.
Causes at
to list the jobs current in the
specified queue.
Tells at
to delete a specified job.
Instructs at
to run the job from a file rather
than standard input.
Instructs batch
and at
to
send mail upon completion, even when no output is generated.
Time can be now, teatime, 7:00 P.M., 7:00 P.M. tomorrow, etc. Check the manpage for more details.
As an example, let's say that you want your boss to think you
were slaving away at 3:00 A.M. Simply send her mail at 3:07 A.M. Even
if you don't plan on being awake, it's no problem. At the
at
command prompt, just type:
> at 3:07am Mail -s "big breakthrough" boss@wherever < /home/jambeck/news <Ctrl-d>
As fast as available disk space on a system expands, users seem to be able to expand their files to fill it. Software takes up more space; output files become larger and more complex; more layers of analysis can be created. Since the infinitely large data-storage medium has yet to be invented, you can still run up against disk-space limitations. So, you need to be able to monitor how much space you are using and, as we'll discuss in Section 5.9.4, how to make data archives and store them on appropriate media.
Usage: du
-[
options
]
filenames
|
du
reports the number of disk blocks used
by the specified file or files. Without a filename, it reports disk
usage for all files in the current working directory. The
-s
flag causes du
to report
values only for the named file, rather than for the file and its
subdirectories.
Usage: df
|
df
reports free diskspace for local and
networked filesystems on your computer. df
is a
useful way to find out which filesystems are mounted on your
computer. If a connection to a filesystem you would expect to find is
down, that filesystem doesn't appear in the
df
output. The df
output
looks like this:
Filesystem Type blocks use avail %use Mounted on /dev/root xfs 17506496 14113592 3392904 81 / /dev/xlv/xlv_raid_home xfs 62783488 39506328 23277160 63 /scratch-res1 /dev/dsk/dks0d5s0 xfs 17259688 15000528 2259160 87 /mnt/root-6.4 /dev/dsk/dks12d1s7 xfs 17711872 11773568 5938304 67 /ip410 /mnt/local/jmd/balder: NFS server balder not responding zeus:/hamr nfs 2205816 703280 1502536 32 /nfs/zeus/hamr zeus:/hamrscr nfs 4058200 2153480 1904720 54 /nfs/zeus/hamrscr zeus:/lcascr1 nfs 142241472 103956480 38284992 74 /nfs/zeus/lcascr1
The first column is the actual location of the filesystem. In this
case, locations preceded with /
are local, and
those preceded with a name (e.g., /zeus:.
..) are
physically part of another machine. The second column shows which
protocol can mount the remote
filesystem—that is, connect it to your computer. The next three
columns show how many blocks are available on the filesystem, how
many of those are in use, and how many are available, followed by the
percent use of each device. The final column shows the local path to
the filesystem.
It's useful to know these things if you are working on a system
that is made up of multiple networked machines. From time to time
connections are lost, like that to balder
in the
previous example. You may log in to a machine that can't find
your home directory because an NFS connection is down. At these
times, it's useful to be able to figure out what the problem is
so you can send a concise and helpful email to the system
administrator rather than just saying "help! My home directory
is missing."
On some Unix systems, especially those that provide services to many users, system administrators implement disk space quotas for each user. The consequences of exceeding a disk space quota may be obvious. You might find that you're unable to write files or that you are automatically prompted to delete files each time you log in. Or, the consequences may be silent, but very annoying. For instance, if you exceed a quota, you may be able to run a text editor, only to find that it has overwritten the file you were editing with a file of length zero. Or your older files may simply start to be deleted as space is needed by other users.
If you're paying for computer time on a shared system, it's in your interest to find out what the user quota for the system is, for how long you can exceed it, what will happen if you exceed it, and where and how you can archive your files.
The quota
command gives basic information about
space usage and quota limits on systems with quotas. On most Unix
systems, issuing the command quota -v
gives
space use information even when user disk quotas haven't been
exceeded.
So, after months of your time, hundreds of megabytes of files, and several layers of subdirectories, the otter project is finally complete. Time to move on to the next project with a clean slate. But as refreshing as it may sound, you can't just type:
% rm -rf otter/
Other people may need to look back at your findings or use them as a starting point for their own research. At the other extreme, you can't leave your files lying around or laboriously copy them a few at a time to another location. Not every file needs to be accessible at all times; some files are replaced, while others are more conveniently stored elsewhere. This section covers the tools provided by Unix for archiving your data so you don't have to worry about it on a day-to-day basis but can find things later when you need them.
Usage: tar
functions
[
options
]
[
arguments
]
filenames
|
After going through all the effort of setting up your filesystem
rationally, it seems like a waste to lose that structure in the
process of storing it away, like hastily packed dishes in an
unexpected cross-country move. Fortunately, there is a Unix command
that lets you work with whole directories of files while retaining
the directory structure.
tar
compacts a directory and all its
component files and (if you ask for it) subdirectories into a single
file with the name of the compacted directory and a
.tar
extension. The options for
tar
break down into two types: functions (of
which you must choose one) and options. tar
is
short for "tape archive," since the utility was
originally designed to read and write archives stored on magnetic
tape. Another common use of tar
is to package
software in a form that can be easily transferred over the Internet.
To run tar
, you must choose one of the following
functions:
Creates a new tape archive
Appends the files to an existing archive
Adds files to the archive if they aren't present or are modified
Extracts files from an existing archive
Prints a table of contents of the archive
The options for tar
are as follows:
Performs the specified operation on archive,
which can either be a device (such as a tape drive or a removable
disk) or a tar
file
(verbose mode) Prints the name of each file archived or extracted
with a character to indicate the function (a
for
archived; x
for extracted)
(whiny mode) Asks for confirmation at every step
Note that neither functions nor options require the hyphen that usually precedes Unix command options.
If you type:
% tar cvf otter/
the otter/
directory and all its subdirectories
are rolled into a single file called otter.tar
.
It's good practice to use the v
option, so
you can see if something is going horribly wrong while the archive is
being processed.
If, on the other hand, you want to make an archive of the
otter/
directory on the tape drive
nftape
, you can type:
% tar cvf /dev/nftape otter/
A couple of warnings about tar
are in order.
First, before you use tar
on your system, you
should use which
to find out whether the GNU or
the standard version is installed. Several of the options mean
different things to each version; the ones listed earlier are the
same in each version.
Second, the tar
file you create will be as large
as all the contents of the directory and subdirectories beneath it.
This condition has dire implications if your archived directory is
large and you have limited disk space, or you need to transfer large
amounts of tar
'd data. In these cases,
you should break down the directory into subdirectories of a more
manageable size, and tar
those instead.
If you don't have the space on your current filesystem or
partition for your files and the archive you are creating to exist
simultaneously, or you wish to download a whole archive file and
unpack it just to retrieve a few files, you can transfer your archive
over the network or even just to another partition using a
combination of ftp
and tar
commands. Sending an archive this way and then extracting it at the
destination can be less time-consuming than a cp
-r
if a large number of files are involved. The
ftp
program recognizes a form in which a command
replaces the input filenames. The command is executed in a subshell
on the local machine and operates on files on the local filesystem.
The construct is:
ftp command "| command" filename
Inside the ftp
program, here's how to send
the output of the tar
command, enclosed in
quotes, into the filename specified as the target on the remote
machine:
put "|tar cvBf - *" filename
Here's how to direct the downloaded archive through the
tar
command, resulting in extraction of only the
files in the specified directory within the archive:
get filename.tar "|tar xvf - dirname"
Finally, here's how to list the contents of the remote archive:
get filename.tar "|tar t - *"
Usage: compress
-[
options
]
filenames
|
Ultimately, you don't want to be left with large—if more
manageable—tar
files cluttering up your
filesystem. In this situation, data-compression utilities are
important, since they allow you to cheat and reduce the amount of
space that files take up on your hard disk.
compress
is the standard Unix file-compression
command. It's the opposite of
uncompress
, the command used in Chapter 3 to open compressed papers and software.
compress
adds a .Z
to the
end of the filename.
Here are the most useful options for compress
:
Forces compression; even if there is already a compressed version of the file, the main effect is to not overwrite an existing compressed file
(verbose mode) Prints percentage compression achieved by the file
(recursive mode) If compress
is applied to a
directory that contains subdirectories, compresses their contents as
well as those of the original directory
If you have a text file named stoat.txt
and the
tar
file of the otter/
directory from the last section, and you want to compress both and
look at the resulting compression ratio achieved, type:
% compress -v stoat.txt otter.tar
This command produces two files stoat.txt.Z
and
otter.tar.Z
. The files can be uncompressed using
the uncompress
command or gzip
-d
(described next). In case you were wondering, natural
languages (the kind humans use) end up with a compression ratio
around 60%, and programming languages get around 40%. Try compressing
the sequences of some of your favorite proteins to see what sort of
ratio you get: the values can be wildly variable, depending on
whether there are repeats in the sequence.
Usage: gzip
-[
options
]
filenames
|
As usual, in addition to the standard Unix
compress
, there's a faster and more
efficient GNU utility: gzip
.
gzip
behaves in much the same way as
compress
, except that it gets better compression
on average, since it uses a superior algorithm.
gzip
adds the suffix .gz
to
a file that it compresses. It emulates the
compress
options described earlier and adds a
few of its own:
[*] GNU tools are distributed and maintained by
the GNU Project at the Free Software Foundation. GNU stands for
"GNU's Not Unix" and refers to a complete,
Unix-like operating system that's built and maintained by the
GNU Project (http://www.Gnu.org
).
[†] This isn't an imaginary format at all. It's pretty close to the format of the output file from a calculation that we do frequently: computing the pKa values of individual amino acids in a protein.
[‡] See the Bibliography for pointers to complete references on vi.
[§] Giving rise to the old joke, "Why do programmers confuse Christmas and Halloween? Because OCT 31 is DEC 25."
[‖] If you need to transform data in a way that isn't allowed by the standard Unix filters, see Chapter 12, in which we discuss the Perl scripting language. Perl is a very complete and sophisticated language that allows you to produce an infinite variety of specialized filters.
[#] If
you are logged in as root
, there are certain
tasks you can't do from a remote terminal.
[**] Discussion of the other signals can be found in any of the comprehensive Unix references listed in the Bibliography.