Databases permit us to allow data to persist beyond the end of our program. The kinds of databases we’re talking about in this chapter are merely simple ones; how to use full-featured database implementations (Oracle, Sybase, Informix, mySQL, and others) is a topic that could fill an entire book, and usually does. The databases in this chapter are those that are simple enough to implement that you don’t need to know about modules to use them.[1]
Every
system that has Perl also has a simple database already available in
the form of DBM files. This lets your program store data for quick
lookup in a file or in a pair of files. When two files are used, one
holds the data and the other holds a table of contents, but you
don’t need to know that in order to use DBM files. We’re
intentionally being a little vague about the exact implementation,
because that will vary depending upon your machine and configuration;
see the AnyDBM_file
manpage for more information. Also,
among the downloadable files from the O’Reilly website is a
utility called
which_dbm,
which tries to tell you which implementation you’re using, how
many files there are, and what extensions they use, if any.
Some DBM file implementations (we’ll call it “a file,” even though it may be two actual files) have a limit of around 1000 bytes for each key and value in the file. Your actual limit may be larger or smaller than this number, but as long as you aren’t trying to store gigantic text strings in the file, it shouldn’t be a problem. There’s no limit to the number of individual data items in the file, as long as you have enough disk space.
In Perl, we can access the DBM file as a special kind of hash called a DBM hash. This is a powerful concept, as we’ll see.
To associate a DBM database with a DBM hash (that is, to open it),
use the dbmopen
function,[2] which looks similar to
open
, in a way:
dbmopen(%DATA, "my_database", 0644) or die "Cannot create my_database: $!";
The first parameter is the name of a Perl hash. (If this hash already
has values, the values are inaccessible while the DBM file is open.)
This hash becomes connected to the DBM database whose name was given
as the second parameter, often stored on disk as a pair of files with
the extensions .
dir and
.
pag. (The filename as given
in the second parameter shouldn’t include either extension,
though; the extensions will be automatically added as needed.) In
this case, the files might be called
my_database.dir and
my_database.pag.
Any legal hash name may be used as the name of the DBM hash, although uppercase-only hash names are traditional because their resemblance to filehandles reminds us that the hash is connected to a file. The hash name isn’t stored anywhere in the file, so you can call it whatever you’d like.
If the file doesn’t exist, it will be created and given a
permission mode based upon the value in the third
parameter.[3] The number is typically specified in
octal; the frequently used value of 0644
gives
read-only permission to everyone but the owner, who gets read/write
permission. If you’re trying to open an existing file,
you’d probably rather have the dbmopen
fail if the file isn’t found, so just use
undef
as the third parameter.
The return value from the dbmopen
is true if the
database could be opened or created, and false otherwise, just like
open
. You should generally use or
die
in the same spirit as open
.
The DBM hash typically stays open throughout the program. When the
program terminates, the association is terminated. You can also break
the association in a manner similar to closing a filehandle, by using
dbmclose
:
dbmclose(%DATA);
Here’s the beauty of the DBM hash: it works just like the hashes you already understand! To read from the file, look at an element of the hash. To write to the file, store something into the hash. In short, it’s like any other hash, but instead of being stored in memory, it’s stored on disk. And thus, when your program opens it up again, the hash is already stuffed full of the data from the previous invocation.
All of the normal hash operations are available:
$DATA{"fred"} = "bedrock"; # create (or update) an element delete $DATA{"barney"}; # remove an element of the database foreach my $key (keys %DATA) { # step through all values print "$key has value of $DATA{$key} "; }
That last loop could have a problem, since keys
has to traverse the entire hash, possibly producing a very large list
of keys. If you are scanning through a DBM hash, it’s generally
more memory-efficient to use the
each
function:
while (my($key, $value) = each(%DATA)) { print "$key has value of $value "; }
If you are accessing DBM files that are maintained by
C programs, you should be aware that
C programs generally tack on a trailing NUL (" "
) character to
the end of their strings, for reasons known only to Kernighan and
Ritchie.[4] The DBM library routines do not need this
NUL (they handle binary data using a byte count, not a NUL-terminated
string), and so the NUL is stored as part of the data.
To cooperate with these programs, you must therefore append a NUL
character to the end of your keys and values, and discard the NUL
from the end of the returned values to have the data make sense. For
example, to look up merlyn
in the sendmail aliases
database on a Unix system, you might do something like this:
dbmopen(my %ALI, "/etc/aliases", undef) or die "no aliases?"; my $value = $ALI{"merlyn "}; # note appended NUL $value =~ s/ $//; # remove trailing NUL print "Randal's mail is headed for: $value "; # show result
If your DBM files may be concurrently accessed by more than one process (for example if they’re being updated over the Web), you’ll generally need to use an auxiliary lock file. The details of this are beyond the scope of this book; see The Perl Cookbook by Tom Christiansen and Nathan Torkington (O’Reilly & Associates, Inc.).
When storing
data into a DBM file (or
in one of the other types of databases we’ll see in this
chapter), you may need to store more than one item under a single
key. And sometimes you’ll need to be able to prepare some
information to be sent over a network connection or to a system-level
function, or to decode it upon arrival. That’s why Perl has the
pack
and unpack
functions.
The pack
function takes a
format string
and a list of arguments and packs the arguments together to make a
string. Here, we can pack three numbers of varying sizes into a
seven-byte string using the formats c
,
s
, and l
(these might remind
some folks of the words “char”, “short”, and
“long”). The first number gets packed into one byte, the
second into two bytes, and the third into four bytes, which explains
why we say this is a seven-byte string:
my $buffer = pack("c s l", 31, 4159, 265359);
When you want the original list of items back, you can use the same
format string with the
unpack
function:
my($char, $short, $long) = unpack("c s l", $buffer);
There are many different format letters available; some of these are
the same on every machine (so they’re useful for sending data
over a network), while others depend upon how your machine likes to
work with data (these are useful for interacting with your
system’s own data). See the
perlfunc
manpage for the latest list of format
letters, as new ones are being added in every new version of Perl.
Whitespace may be used at will in a
format string to improve readability, as we did in the previous
example. For most format letters, you can follow the format letter
with a number to indicate a number of times; that is, a format of
"ccccccc"
may be written more compactly as
"c7"
. Instead of a number, you may follow the last
format letter with a star (*
), which means to use
that format as many times as needed to use up the remaining items in
the list (in pack
) or to use up the rest of the
string (in unpack
). So a format of
"c*"
will either unpack a string into a list of
small integers, or pack up those small integers to make a string. For
some format letters, such as a
, the number is not
a repeat count; "a20"
is a twenty-character ASCII
string, padded with NUL characters as needed.
Another form of persistent data is the fixed-length, record-oriented disk file.[5] In this scheme, the data consists of a number of records of identical length. The numbering of the records is either not important or determined by some indexing scheme.
For example, we might want to store some information about each bowler at Bedrock Lanes. Let’s say we decide to have a series of records, one per bowler, in which the data holds the player’s name, age, last five bowling scores, and the time and date of his last game.
We need to decide upon a suitable format for this data. Let’s
say that after studying the available formats in the documentation
for pack
, we decide to use 40 characters for the
player’s name, a one-byte integer for his age,[6] five
two-byte integers for his last five scores,[7] and a four-byte integer for the
timestamp of his most-recent game,[8] giving a format string of "a40 C
I5 L"
. Each record is thus 55 bytes long. If we were
reading all of the data in the database, we’d read chunks of 55
bytes until we got to the end. If we wanted to go to the fifth
record, we’d skip ahead 4 x 55 bytes (220 bytes) and read
the fifth record directly.
Perl supports programs that use such a disk file. In order to do so, however, you need to learn a few more things, including how to:
Open a disk file for both reading and writing
Move around in this file to an arbitrary position
Fetch data by a length rather than up to the next newline
Write data down in fixed-length blocks
The open
function has an
additional mode we haven’t shown yet. If you use
"+<"
at the front of the filename
parameter’s string, that is similar to using
"<"
to open the existing file for reading,
except that it also asks for write permission on the file. Thus you
can have read/write access to the file:
open(FRED, "<fred"); # open file fred for reading (error if file absent) open(FRED, "+<fred"); # open file fred read/write (error if file absent)
Similarly, "+>"
says to create a new file (as
">"
would), but to have read access to it as
well, thus also giving read/write access:
open(WILMA, ">wilma"); # make new file wilma (wiping out existing file) open(WILMA, "+>wilma"); # make new file wilma, but also with read access
Do you see the important difference between the two new modes? Both
give read/write access to a file. But "+<"
lets
you work with an existing file; it doesn’t create it. The
second mode, "+>"
isn’t often useful,
because it gives read/write access to a new, empty file that it has
just created. That’s mostly used for temporary (scratch) files.
Once we’ve got the file open, we need to move around in it. You
do this with the seek
function:
seek(FRED, 55 * $n, 0); # seek to start of record $n
The first parameter to seek
is a filehandle, the
second parameter gives the offset in bytes from the start of the
file, and the third parameter is zero.[9] To get to a certain record in our file of bowling data,
you’ll need to skip over some other records. Since each record
is 55 bytes long, we’ll multiply $n
times
55
to find out which byte position we want. (Note
that the record numbers are thus zero-based; record zero is at the
beginning of the file.)
Once the file pointer has been positioned with
seek
, the next input or output operation will
start at that position.
When we’re ready to read from the file, we can’t use the
ordinary line-input operator because that’s made to read lines,
not 55-byte records. There may not be a newline character in this
entire file, or it may appear in packed data in the middle of a
record. Instead, we’ll use the
read
function:
my $buf; # The input buffer variable my $number_read = read(FRED, $buf, 55);
As you can see, the first parameter to read
is
the filehandle. The second parameter is a buffer variable; the data
read will be placed into this variable. (Yes, this is an odd way to
get the result.) The third parameter is the number of bytes to read;
here we’ve asked for 55 bytes, since that’s the size of
our record. Normally, you can expect the length of
$buf
to be the specified number of bytes, and you
can expect that the return value (in $number_read
)
to be the same. But if your current position in the file is only five
bytes from the end when you request 55 bytes, you’ll get only
five. Under normal circumstances, you’ll get as many bytes as
you ask for.
Once you’ve got those 55 bytes, what can you do with them? You can unpack them (using the format we previously designed) to get the bowler’s name and other information, of course:
my($name, $age, $score_1, $score_2, $score_3, $score_4, $score_5, $when) = unpack "a40 C I5 L", $buf;
Since we can read the information from the file with
read
, can you guess how we can write it back
into the file? Sorry, it’s not write
; that
was a trick question.[10] You already know the correct function, which is
print
. But you have to be sure that the data
string is exactly the right size; if it’s too large,
you’ll overwrite the next record’s data, but if
it’s too small, leftover data in the current record may be
mixed with the new data. To ensure that the length is correct,
we’ll use pack
. Let’s say that Wilma
has just bowled a game and her new score is in
$new_score
. That will be the first of the five
most-recent scores we keep for her ($score_5
, as
the oldest one, will be discarded), and in place of
$when
(the timestamp of her previous game),
we’ll store the current time from the time
function:
print FRED pack("a40 C I5 L", $name, $age, $new_score, $score_1, $score_2, $score_3, $score_4, time);
On some systems, you’ll have to use seek
whenever you switch from reading to writing, even if the current
position in the file is already correct. It’s not a bad idea,
then, to always use seek
right before reading or
printing.
Rather than use the two constant values "a40 C I5
L"
and 55
throughout the program, as
we’ve done here, it would generally be better to define them
just once near the top of the code. That way, if we ever need to
change the database format, we don’t have to go searching
through our code for places where the number 55
appears. Here’s one way you might define both of those values,
using the length
function to determine the length of a
string so you won’t have to count bytes:
my $pack_format = "a40 C I5 L"; my $pack_length = length pack($pack_format, "dummy data", 0, 1, 2, 3, 4, 5, 6);
Many simple databases are merely text files written in a format that allows a program to read and maintain them. For example, a configuration file for some program might be a text file, with one configuration parameter being set on each line. Or maybe the file is a mailing list, with one name and address on each line (probably with the components of the name and address separated by tab characters).
Updating text files is more difficult than it probably seems at first. But that’s only because we’re used to seeing text files rendered as pages (or screens) of text. If you could see the file as it is written in the filesystem, the difficulty is more apparent. Since we can’t show you the file as it’s actually written without opening up a disk drive, here’s our rendition of a piece of a text file[11]:
He had bought a large map representing the sea, Without the l east vestige of land: And the crew were much pleased when they found it to be A map they could all understand. "What's th e good of Mercator's North Poles and Equators, Tropics, Zones , and Meridian Lines?" So the Bellman would cry: and the crew w ould reply "They are merely conventional signs! "Other map s are such shapes, with their islands and capes! But we've go t our brave Captain to thank:" (So the crew would protest) "tha t he's bought us the best- A perfect and absolute blank!"
If you had this file open in your text editor, it would be easy to change a word, add a comma, or fix a misspelling. If your editor is powerful enough, in fact, you could change the indentation of each line with a single command. But the text file is a stream of bytes; if you wanted to add even a single comma, the remainder of the text file (possibly thousands or millions of bytes) would have to move over to make room. Nearly every tiny change would mean lots of slow copying operations on the file. So how can we edit the file efficiently?
The most common way of programmatically updating a text file is by writing an entirely new file that looks similar to the old one, but making whatever changes we need as we go along. As you’ll see, this technique gives nearly the same result as updating the file itself, but it has some beneficial side effects as well.
In this example, we’ve got hundreds of files with a similar format. One of them is fred03.dat, and it’s full of lines like these:
Program name: granite Author: Gilbert Bates Company: RockSoft Department: R&D Phone: +1 503 555-0095 Date: Tues March 9, 1999 Version: 2.1 Size: 21k Status: Final beta
We need to fix this file so that it has some different information. Here’s roughly what this one should look like when we’re done:
Program name: granite Author: Randal L. Schwartz Company: RockSoft Department: R&D Date: June 12, 2002 6:38 pm Version: 2.1 Size: 21k Status: Final beta
In short, we need to make three changes. The name of the
Author
should be changed; the
Date
should be updated to today’s date, and
the Phone
should be removed completely. And we
have to make these changes in hundreds of similar files as well.
Perl supports a way of in-place editing of
files with a little extra help from
the diamond operator
(”<>
“). Here’s a program
to do what we want, although it may not be obvious how it works at
first. This program’s only new feature is the special variable
$^I
; ignore that for now, and we’ll come
back to it:
#!/usr/bin/perl -w use strict; chomp(my $date = `date`); @ARGV = glob "fred*.dat" or die "no files found"; $^I = ".bak"; while (<>) { s/^Author:.*/Author: Randal L. Schwartz/; s/^Phone:.* //; s/^Date:.*/Date: $date/; print; }
Since we need today’s date, the program starts by using the
system date
command. A better way to get the date
(in a slightly different format) would almost surely be to use
Perl’s own
localtime
function in a scalar context:
my $date = localtime;
To get the list of files for the diamond operator, we read them from
a glob. The next line sets $^I
, but keep ignoring
that for the moment.
The main loop reads, updates, and prints one line at a time. (With
what you know so far, that means that all of the files’ newly
modified contents will be dumped to your terminal, scrolling
furiously past your eyes, without the files being changed at all. But
stick with us.) Note that the second substitution can replace the
entire line containing the phone number with an empty
string—leaving not even a newline—so when that’s
printed, nothing comes out, and it’s as if the
Phone
never existed. Most input lines won’t
match any of the three patterns, and those will be unchanged in the
output.
So this result is close to what we want, except that we haven’t
shown you how the updated information gets back out on to the disk.
The answer is in the variable
$^I
. By default it’s
undef
, and everything is normal. But when
it’s set to some string, it makes the diamond operator
(”<>
“) even more magical than
usual.
We already know about much of the diamond’s magic—it will
automatically open and close a series of files for you, or read from
the standard-input stream if there aren’t any filenames given.
But when there’s a string in $^I
, that
string is used as a backup filename’s extension. Let’s
see that in action.
Let’s say it’s time for the diamond to open our file fred03.dat. It opens it like before, but now it renames it, calling it fred03.dat.bak.[12] We’ve still got the same file open, but now it has a different name on the disk. Next, the diamond creates a new file and gives it the name fred03.dat. That’s okay; we weren’t using that name any more. And now the diamond selects the new file as the default for output, so that anything that we print will go into that file.[13]
So now the while
loop will read a line from the
old file, update that, and print it out to the new file. This program
can update hundreds of files in a few seconds on a typical machine.
Pretty powerful, huh?
Once the program has finished, what does the user see? The user says, “Ah, I see what happened! Perl edited my file fred03.dat, making the changes I needed, and saved me a copy of the original in the backup file fred03.dat.bak just to be helpful!” But we now know the truth: Perl didn’t really edit any file. It made a modified copy, said “Abracadabra!”, and switched the files around while we were watching sparks come out of the magic wand. Tricky.
Some folks use a
tilde
(”~
“) as the value for
$^I
, since that resembles what
emacs does for backup files. Another possible
value for $^I
is the empty string. This enables
in-place editing, but doesn’t save the original data in a
backup file. But since a small typo in your pattern could wipe out
all of the old data, using the empty string is recommended only if
you want to find out how good your backup tapes are. It’s easy
enough to delete the backup files when you’re done. And when
something goes wrong and you need to rename the backup files to their
original names, you’ll be glad that you know how to use Perl to
do that (see the multiple-file rename example in Chapter 13).
A program like the example from the previous section is fairly easy to write. But Larry decided it wasn’t easy enough.
Imagine that you need to update hundreds of files that have the
misspelling Randall
instead of the
one-l
name Randal
. You could
write a program like the one in the previous section. Or you could do
it all with a one-line program, right on the command line:
$ perl -p -i.bak -w -e 's/Randall/Randal/g' fred*.dat
Perl has a whole slew of command-line options that can be used to build a complete program in a few keystrokes.[14] Let’s see what these few do.
Starting the command with perl
does something like
putting #!/usr/bin/perl
at the top of a file does:
it says to use the program perl to process what
follows.
The -p
option tells Perl to write a program for
you. It’s not much of a program, though; it looks something
like this:[15]
while (<>) { print; }.
If you want even less, you could use
-n
instead; that leaves out the
print
statement. (Fans of awk
will recognize -p
and -n
.)
Again, it’s not much of a program, but it’s pretty good
for the price of a few keystrokes.
The next option is
-i.bak
, which you might have guessed sets
$^I
to ".bak"
before the
program starts. If you don’t want a backup file, you can use
-i
alone, with no extension.
We’ve seen -w
before—it turns on warnings.
The -e
option says “executable code
follows.” That means that the
s/Randall/Randal/g
string is treated as Perl code.
Since we’ve already got a while
loop (from
the -p
option), this code is put inside the loop,
before the print
. For technical reasons, the last
semicolon in the -e
code is optional. But if you
have more than one -e
, and thus more than one
chunk of code, only the semicolon at the end of the last one may
safely be omitted.
The last command-line parameter is fred*.dat
,
which says that @ARGV
should hold the list of
filenames that match that glob. Put the pieces all together, and
it’s as if we had written a program like this:
#!/usr/bin/perl -w @ARGV = glob "fred*.dat"; $^I = ".bak"; while (<>) { s/Randall/Randal/g; print; }
Compare this program to the one we used in the previous section. It’s pretty similar. These command-line options are pretty handy, aren’t they?
These exercises are all related; it may be helpful to see what the second and third should do before starting on the first. See Section A.15 for answers.
[15] Make a program that will read through the
perlfunc.pod file looking for identifier names
on =item
lines (as in the similar exercise at the
end of Chapter 9). The program should write a
database showing the first line number on which
each identifier appears. That is, if fred
was
mentioned on lines 23, 29, and 54, the value stored under the key
fred
would be 23. (Hint: the special
$.
variable gives the line number of the line that
was just read.)
[10] Make a program that will take a Perl function name on the
command line, and report what =item
line of the
perlfunc.pod file first mentions that function.
Your program should not have to read through a long file to get this
answer. What should your program do if the function name isn’t
found?
[10] (Extra credit exercise.) Modify the program from the previous
exercise so that when the function is found in the database, your
program will launch your favorite pager program to view the
perlfunc.pod file at that line. (Hint: many
programs that can be used for viewing text files work like
less does, with a command line like less
+1234
filename
to start viewing
the file at line 1234. Your favorite text editor may also support
this convention, which is also used by more,
pico, vi,
emacs, and view.)
[1] To be sure, on some of these, the core of Perl will load a module for you. But you don’t need to know anything about modules to use these databases.
[2] Here we
depart from other beginner documentation, which claims that
dbmopen
is deprecated and suggests that you use
the more complicated tie
interface instead. We disagree, since
dbmopen
works just fine, and it keeps you from
having to think harder about what you’re doing. Keep the common
tasks simple!
[3] The actual mode will be modified by the
umask
; see the
perlfuncmanpage for more
information.
[4] Well, they’re not the only ones: it’s because C uses the NUL byte as the end-of-string marker.
[5] By “fixed-length,” we don’t mean that the file itself is of a fixed length; it’s each individual record that is of a fixed length. In this section, we’ll use an example file in which every record is 55 bytes long.
[6] Since one byte may have 256 different values, this will hold ages from 0 to 255 with ease. If Methuselah comes to bowl in Bedrock, we’ll have to redesign the database.
[7] We can’t use one-byte integers for the scores, because a bowling score can be as high as 300. Two-byte integers can hold values from 0 to 65535 (if unsigned) or -32768 to 32767 (if signed). We can use some of these extra values as special codes; for example, if a player has only three games on record, the other scores could be set to 9999 to indicate this.
[8] The standard Unix timestamp format (and the time value used by many other systems) is a 32-bit integer, which fits into four bytes, of course. You’ll probably find it handy to use a module to manipulate time and date formats.
[9] Actually, the third parameter is the “whence” parameter. You can use a different value than zero if you want to seek to a position relative to the current position, or relative to the end of the file; see the perlfuncmanpage for more information. Most people will simply want to use zero here.
[10] Perl actually does have a
write
function, but that is used with formats,
which are beyond the scope of this book. See the
perlformmanpage.
[11] Of course, the real file wouldn’t have lines at all; it’s one long stream of text. And the newline character should really be a single-character code. But these differences don’t hurt this as an example.
[12] Some of the details of this procedure will vary on non-Unix systems, but the end result should be nearly the same. See the release notes for your port of Perl.
[13] The diamond also tries to duplicate the original file’s permission and ownership settings as much as possible; for example, if the old one was world-readable, the new one should be, as well.
[14] See the perlrunmanpage for the complete list.
[15] Actually, the print
occurs in a continue
block. See the
perlsynand
perlrunmanpages for more
information.