Chapter 16. Simple Databases

Databases permit us to allow data to persist beyond the end of our program. The kinds of databases we’re talking about in this chapter are merely simple ones; how to use full-featured database implementations (Oracle, Sybase, Informix, mySQL, and others) is a topic that could fill an entire book, and usually does. The databases in this chapter are those that are simple enough to implement that you don’t need to know about modules to use them.[1]

DBM Files and DBM Hashes

Every system that has Perl also has a simple database already available in the form of DBM files. This lets your program store data for quick lookup in a file or in a pair of files. When two files are used, one holds the data and the other holds a table of contents, but you don’t need to know that in order to use DBM files. We’re intentionally being a little vague about the exact implementation, because that will vary depending upon your machine and configuration; see the AnyDBM_file manpage for more information. Also, among the downloadable files from the O’Reilly website is a utility called which_dbm, which tries to tell you which implementation you’re using, how many files there are, and what extensions they use, if any.

Some DBM file implementations (we’ll call it “a file,” even though it may be two actual files) have a limit of around 1000 bytes for each key and value in the file. Your actual limit may be larger or smaller than this number, but as long as you aren’t trying to store gigantic text strings in the file, it shouldn’t be a problem. There’s no limit to the number of individual data items in the file, as long as you have enough disk space.

In Perl, we can access the DBM file as a special kind of hash called a DBM hash. This is a powerful concept, as we’ll see.

Opening and Closing DBM Hashes

To associate a DBM database with a DBM hash (that is, to open it), use the dbmopen function,[2] which looks similar to open, in a way:

dbmopen(%DATA, "my_database", 0644)
  or die "Cannot create my_database: $!";

The first parameter is the name of a Perl hash. (If this hash already has values, the values are inaccessible while the DBM file is open.) This hash becomes connected to the DBM database whose name was given as the second parameter, often stored on disk as a pair of files with the extensions . dir and . pag. (The filename as given in the second parameter shouldn’t include either extension, though; the extensions will be automatically added as needed.) In this case, the files might be called my_database.dir and my_database.pag.

Any legal hash name may be used as the name of the DBM hash, although uppercase-only hash names are traditional because their resemblance to filehandles reminds us that the hash is connected to a file. The hash name isn’t stored anywhere in the file, so you can call it whatever you’d like.

If the file doesn’t exist, it will be created and given a permission mode based upon the value in the third parameter.[3] The number is typically specified in octal; the frequently used value of 0644 gives read-only permission to everyone but the owner, who gets read/write permission. If you’re trying to open an existing file, you’d probably rather have the dbmopen fail if the file isn’t found, so just use undef as the third parameter.

The return value from the dbmopen is true if the database could be opened or created, and false otherwise, just like open. You should generally use or die in the same spirit as open.

The DBM hash typically stays open throughout the program. When the program terminates, the association is terminated. You can also break the association in a manner similar to closing a filehandle, by using dbmclose :

dbmclose(%DATA);

Using a DBM Hash

Here’s the beauty of the DBM hash: it works just like the hashes you already understand! To read from the file, look at an element of the hash. To write to the file, store something into the hash. In short, it’s like any other hash, but instead of being stored in memory, it’s stored on disk. And thus, when your program opens it up again, the hash is already stuffed full of the data from the previous invocation.

All of the normal hash operations are available:

$DATA{"fred"} = "bedrock";      # create (or update) an element
delete $DATA{"barney"};         # remove an element of the database

foreach my $key (keys %DATA) {  # step through all values
  print "$key has value of $DATA{$key}
";
}

That last loop could have a problem, since keys has to traverse the entire hash, possibly producing a very large list of keys. If you are scanning through a DBM hash, it’s generally more memory-efficient to use the each function:

while (my($key, $value) = each(%DATA)) {
  print "$key has value of $value
";
}

If you are accessing DBM files that are maintained by C programs, you should be aware that C programs generally tack on a trailing NUL ("") character to the end of their strings, for reasons known only to Kernighan and Ritchie.[4] The DBM library routines do not need this NUL (they handle binary data using a byte count, not a NUL-terminated string), and so the NUL is stored as part of the data.

To cooperate with these programs, you must therefore append a NUL character to the end of your keys and values, and discard the NUL from the end of the returned values to have the data make sense. For example, to look up merlyn in the sendmail aliases database on a Unix system, you might do something like this:

dbmopen(my %ALI, "/etc/aliases", undef) or die "no aliases?";
my $value = $ALI{"merlyn"};                  # note appended NUL
$value =~ s/$//;                             # remove trailing NUL
print "Randal's mail is headed for: $value
"; # show result

If your DBM files may be concurrently accessed by more than one process (for example if they’re being updated over the Web), you’ll generally need to use an auxiliary lock file. The details of this are beyond the scope of this book; see The Perl Cookbook by Tom Christiansen and Nathan Torkington (O’Reilly & Associates, Inc.).

Manipulating Data with pack and unpack

When storing data into a DBM file (or in one of the other types of databases we’ll see in this chapter), you may need to store more than one item under a single key. And sometimes you’ll need to be able to prepare some information to be sent over a network connection or to a system-level function, or to decode it upon arrival. That’s why Perl has the pack and unpack functions.

The pack function takes a format string and a list of arguments and packs the arguments together to make a string. Here, we can pack three numbers of varying sizes into a seven-byte string using the formats c, s, and l (these might remind some folks of the words “char”, “short”, and “long”). The first number gets packed into one byte, the second into two bytes, and the third into four bytes, which explains why we say this is a seven-byte string:

my $buffer = pack("c s l", 31, 4159, 265359);

When you want the original list of items back, you can use the same format string with the unpack function:

my($char, $short, $long) = unpack("c s l", $buffer);

There are many different format letters available; some of these are the same on every machine (so they’re useful for sending data over a network), while others depend upon how your machine likes to work with data (these are useful for interacting with your system’s own data). See the perlfunc manpage for the latest list of format letters, as new ones are being added in every new version of Perl.

Whitespace may be used at will in a format string to improve readability, as we did in the previous example. For most format letters, you can follow the format letter with a number to indicate a number of times; that is, a format of "ccccccc" may be written more compactly as "c7". Instead of a number, you may follow the last format letter with a star (*), which means to use that format as many times as needed to use up the remaining items in the list (in pack) or to use up the rest of the string (in unpack). So a format of "c*" will either unpack a string into a list of small integers, or pack up those small integers to make a string. For some format letters, such as a, the number is not a repeat count; "a20" is a twenty-character ASCII string, padded with NUL characters as needed.

Fixed-length Random-access Databases

Another form of persistent data is the fixed-length, record-oriented disk file.[5] In this scheme, the data consists of a number of records of identical length. The numbering of the records is either not important or determined by some indexing scheme.

For example, we might want to store some information about each bowler at Bedrock Lanes. Let’s say we decide to have a series of records, one per bowler, in which the data holds the player’s name, age, last five bowling scores, and the time and date of his last game.

We need to decide upon a suitable format for this data. Let’s say that after studying the available formats in the documentation for pack, we decide to use 40 characters for the player’s name, a one-byte integer for his age,[6] five two-byte integers for his last five scores,[7] and a four-byte integer for the timestamp of his most-recent game,[8] giving a format string of "a40 C I5 L". Each record is thus 55 bytes long. If we were reading all of the data in the database, we’d read chunks of 55 bytes until we got to the end. If we wanted to go to the fifth record, we’d skip ahead 4 x 55 bytes (220 bytes) and read the fifth record directly.

Perl supports programs that use such a disk file. In order to do so, however, you need to learn a few more things, including how to:

  1. Open a disk file for both reading and writing

  2. Move around in this file to an arbitrary position

  3. Fetch data by a length rather than up to the next newline

  4. Write data down in fixed-length blocks

The open function has an additional mode we haven’t shown yet. If you use "+<" at the front of the filename parameter’s string, that is similar to using "<" to open the existing file for reading, except that it also asks for write permission on the file. Thus you can have read/write access to the file:

open(FRED, "<fred");  # open file fred for reading (error if file absent)
open(FRED, "+<fred"); # open file fred read/write (error if file absent)

Similarly, "+>" says to create a new file (as ">" would), but to have read access to it as well, thus also giving read/write access:

open(WILMA, ">wilma");  # make new file wilma (wiping out existing file)
open(WILMA, "+>wilma"); # make new file wilma, but also with read access

Do you see the important difference between the two new modes? Both give read/write access to a file. But "+<" lets you work with an existing file; it doesn’t create it. The second mode, "+>" isn’t often useful, because it gives read/write access to a new, empty file that it has just created. That’s mostly used for temporary (scratch) files.

Once we’ve got the file open, we need to move around in it. You do this with the seek function:

seek(FRED, 55 * $n, 0);  # seek to start of record $n

The first parameter to seek is a filehandle, the second parameter gives the offset in bytes from the start of the file, and the third parameter is zero.[9] To get to a certain record in our file of bowling data, you’ll need to skip over some other records. Since each record is 55 bytes long, we’ll multiply $n times 55 to find out which byte position we want. (Note that the record numbers are thus zero-based; record zero is at the beginning of the file.)

Once the file pointer has been positioned with seek, the next input or output operation will start at that position.

When we’re ready to read from the file, we can’t use the ordinary line-input operator because that’s made to read lines, not 55-byte records. There may not be a newline character in this entire file, or it may appear in packed data in the middle of a record. Instead, we’ll use the read function:

my $buf;  # The input buffer variable
my $number_read = read(FRED, $buf, 55);

As you can see, the first parameter to read is the filehandle. The second parameter is a buffer variable; the data read will be placed into this variable. (Yes, this is an odd way to get the result.) The third parameter is the number of bytes to read; here we’ve asked for 55 bytes, since that’s the size of our record. Normally, you can expect the length of $buf to be the specified number of bytes, and you can expect that the return value (in $number_read) to be the same. But if your current position in the file is only five bytes from the end when you request 55 bytes, you’ll get only five. Under normal circumstances, you’ll get as many bytes as you ask for.

Once you’ve got those 55 bytes, what can you do with them? You can unpack them (using the format we previously designed) to get the bowler’s name and other information, of course:

my($name, $age, $score_1, $score_2, $score_3, $score_4, $score_5, $when)
  = unpack "a40 C I5 L", $buf;

Since we can read the information from the file with read, can you guess how we can write it back into the file? Sorry, it’s not write; that was a trick question.[10] You already know the correct function, which is print . But you have to be sure that the data string is exactly the right size; if it’s too large, you’ll overwrite the next record’s data, but if it’s too small, leftover data in the current record may be mixed with the new data. To ensure that the length is correct, we’ll use pack. Let’s say that Wilma has just bowled a game and her new score is in $new_score. That will be the first of the five most-recent scores we keep for her ($score_5, as the oldest one, will be discarded), and in place of $when (the timestamp of her previous game), we’ll store the current time from the time function:

print FRED pack("a40 C I5 L",
  $name, $age,
  $new_score, $score_1, $score_2, $score_3, $score_4,
  time);

On some systems, you’ll have to use seek whenever you switch from reading to writing, even if the current position in the file is already correct. It’s not a bad idea, then, to always use seek right before reading or printing.

Rather than use the two constant values "a40 C I5 L" and 55 throughout the program, as we’ve done here, it would generally be better to define them just once near the top of the code. That way, if we ever need to change the database format, we don’t have to go searching through our code for places where the number 55 appears. Here’s one way you might define both of those values, using the length function to determine the length of a string so you won’t have to count bytes:

my $pack_format = "a40 C I5 L";
my $pack_length = length pack($pack_format, "dummy data", 
  0, 1, 2, 3, 4, 5, 6);

Variable-length (Text) Databases

Many simple databases are merely text files written in a format that allows a program to read and maintain them. For example, a configuration file for some program might be a text file, with one configuration parameter being set on each line. Or maybe the file is a mailing list, with one name and address on each line (probably with the components of the name and address separated by tab characters).

Updating text files is more difficult than it probably seems at first. But that’s only because we’re used to seeing text files rendered as pages (or screens) of text. If you could see the file as it is written in the filesystem, the difficulty is more apparent. Since we can’t show you the file as it’s actually written without opening up a disk drive, here’s our rendition of a piece of a text file[11]:

He had bought a large map representing the sea,
  Without the l
east vestige of land:
And the crew were much pleased when they 
found it to be
  A map they could all understand.

"What's th
e good of Mercator's North Poles and Equators,
  Tropics, Zones
, and Meridian Lines?"
So the Bellman would cry: and the crew w
ould reply
  "They are merely conventional signs!

"Other map
s are such shapes, with their islands and capes!
  But we've go
t our brave Captain to thank:"
(So the crew would protest) "tha
t he's bought us the best-
  A perfect and absolute blank!"

If you had this file open in your text editor, it would be easy to change a word, add a comma, or fix a misspelling. If your editor is powerful enough, in fact, you could change the indentation of each line with a single command. But the text file is a stream of bytes; if you wanted to add even a single comma, the remainder of the text file (possibly thousands or millions of bytes) would have to move over to make room. Nearly every tiny change would mean lots of slow copying operations on the file. So how can we edit the file efficiently?

The most common way of programmatically updating a text file is by writing an entirely new file that looks similar to the old one, but making whatever changes we need as we go along. As you’ll see, this technique gives nearly the same result as updating the file itself, but it has some beneficial side effects as well.

In this example, we’ve got hundreds of files with a similar format. One of them is fred03.dat, and it’s full of lines like these:

Program name: granite
Author: Gilbert Bates
Company: RockSoft
Department: R&D
Phone: +1 503 555-0095
Date: Tues March 9, 1999
Version: 2.1
Size: 21k
Status: Final beta

We need to fix this file so that it has some different information. Here’s roughly what this one should look like when we’re done:

Program name: granite
Author: Randal L. Schwartz
Company: RockSoft
Department: R&D
Date: June 12, 2002 6:38 pm
Version: 2.1
Size: 21k
Status: Final beta

In short, we need to make three changes. The name of the Author should be changed; the Date should be updated to today’s date, and the Phone should be removed completely. And we have to make these changes in hundreds of similar files as well.

Perl supports a way of in-place editing of files with a little extra help from the diamond operator (”<>“). Here’s a program to do what we want, although it may not be obvious how it works at first. This program’s only new feature is the special variable $^I; ignore that for now, and we’ll come back to it:

#!/usr/bin/perl -w

use strict;

chomp(my $date = `date`);
@ARGV = glob "fred*.dat" or die "no files found";
$^I = ".bak";

while (<>) {
  s/^Author:.*/Author: Randal L. Schwartz/;
  s/^Phone:.*
//;
  s/^Date:.*/Date: $date/;
  print;
}

Since we need today’s date, the program starts by using the system date command. A better way to get the date (in a slightly different format) would almost surely be to use Perl’s own localtime function in a scalar context:

my $date = localtime;

To get the list of files for the diamond operator, we read them from a glob. The next line sets $^I, but keep ignoring that for the moment.

The main loop reads, updates, and prints one line at a time. (With what you know so far, that means that all of the files’ newly modified contents will be dumped to your terminal, scrolling furiously past your eyes, without the files being changed at all. But stick with us.) Note that the second substitution can replace the entire line containing the phone number with an empty string—leaving not even a newline—so when that’s printed, nothing comes out, and it’s as if the Phone never existed. Most input lines won’t match any of the three patterns, and those will be unchanged in the output.

So this result is close to what we want, except that we haven’t shown you how the updated information gets back out on to the disk. The answer is in the variable $^I . By default it’s undef, and everything is normal. But when it’s set to some string, it makes the diamond operator (”<>“) even more magical than usual.

We already know about much of the diamond’s magic—it will automatically open and close a series of files for you, or read from the standard-input stream if there aren’t any filenames given. But when there’s a string in $^I, that string is used as a backup filename’s extension. Let’s see that in action.

Let’s say it’s time for the diamond to open our file fred03.dat. It opens it like before, but now it renames it, calling it fred03.dat.bak.[12] We’ve still got the same file open, but now it has a different name on the disk. Next, the diamond creates a new file and gives it the name fred03.dat. That’s okay; we weren’t using that name any more. And now the diamond selects the new file as the default for output, so that anything that we print will go into that file.[13]

So now the while loop will read a line from the old file, update that, and print it out to the new file. This program can update hundreds of files in a few seconds on a typical machine. Pretty powerful, huh?

Once the program has finished, what does the user see? The user says, “Ah, I see what happened! Perl edited my file fred03.dat, making the changes I needed, and saved me a copy of the original in the backup file fred03.dat.bak just to be helpful!” But we now know the truth: Perl didn’t really edit any file. It made a modified copy, said “Abracadabra!”, and switched the files around while we were watching sparks come out of the magic wand. Tricky.

Some folks use a tilde (”~“) as the value for $^I, since that resembles what emacs does for backup files. Another possible value for $^I is the empty string. This enables in-place editing, but doesn’t save the original data in a backup file. But since a small typo in your pattern could wipe out all of the old data, using the empty string is recommended only if you want to find out how good your backup tapes are. It’s easy enough to delete the backup files when you’re done. And when something goes wrong and you need to rename the backup files to their original names, you’ll be glad that you know how to use Perl to do that (see the multiple-file rename example in Chapter 13).

In-place Editing from the Command Line

A program like the example from the previous section is fairly easy to write. But Larry decided it wasn’t easy enough.

Imagine that you need to update hundreds of files that have the misspelling Randall instead of the one-l name Randal. You could write a program like the one in the previous section. Or you could do it all with a one-line program, right on the command line:

$ perl -p -i.bak -w -e 's/Randall/Randal/g' fred*.dat

Perl has a whole slew of command-line options that can be used to build a complete program in a few keystrokes.[14] Let’s see what these few do.

Starting the command with perl does something like putting #!/usr/bin/perl at the top of a file does: it says to use the program perl to process what follows.

The -p option tells Perl to write a program for you. It’s not much of a program, though; it looks something like this:[15]

while (<>) { print; }.

If you want even less, you could use -n instead; that leaves out the print statement. (Fans of awk will recognize -p and -n.) Again, it’s not much of a program, but it’s pretty good for the price of a few keystrokes.

The next option is -i.bak , which you might have guessed sets $^I to ".bak" before the program starts. If you don’t want a backup file, you can use -i alone, with no extension.

We’ve seen -w before—it turns on warnings.

The -e option says “executable code follows.” That means that the s/Randall/Randal/g string is treated as Perl code. Since we’ve already got a while loop (from the -p option), this code is put inside the loop, before the print. For technical reasons, the last semicolon in the -e code is optional. But if you have more than one -e, and thus more than one chunk of code, only the semicolon at the end of the last one may safely be omitted.

The last command-line parameter is fred*.dat, which says that @ARGV should hold the list of filenames that match that glob. Put the pieces all together, and it’s as if we had written a program like this:

#!/usr/bin/perl -w

@ARGV = glob "fred*.dat";
$^I = ".bak";

while (<>) {
  s/Randall/Randal/g;
  print;
}

Compare this program to the one we used in the previous section. It’s pretty similar. These command-line options are pretty handy, aren’t they?

Exercises

These exercises are all related; it may be helpful to see what the second and third should do before starting on the first. See Section A.15 for answers.

  1. [15] Make a program that will read through the perlfunc.pod file looking for identifier names on =item lines (as in the similar exercise at the end of Chapter 9). The program should write a database showing the first line number on which each identifier appears. That is, if fred was mentioned on lines 23, 29, and 54, the value stored under the key fred would be 23. (Hint: the special $. variable gives the line number of the line that was just read.)

  2. [10] Make a program that will take a Perl function name on the command line, and report what =item line of the perlfunc.pod file first mentions that function. Your program should not have to read through a long file to get this answer. What should your program do if the function name isn’t found?

  3. [10] (Extra credit exercise.) Modify the program from the previous exercise so that when the function is found in the database, your program will launch your favorite pager program to view the perlfunc.pod file at that line. (Hint: many programs that can be used for viewing text files work like less does, with a command line like less +1234 filename to start viewing the file at line 1234. Your favorite text editor may also support this convention, which is also used by more, pico, vi, emacs, and view.)



[1] To be sure, on some of these, the core of Perl will load a module for you. But you don’t need to know anything about modules to use these databases.

[2] Here we depart from other beginner documentation, which claims that dbmopen is deprecated and suggests that you use the more complicated tie interface instead. We disagree, since dbmopen works just fine, and it keeps you from having to think harder about what you’re doing. Keep the common tasks simple!

[3] The actual mode will be modified by the umask; see the perlfuncmanpage for more information.

[4] Well, they’re not the only ones: it’s because C uses the NUL byte as the end-of-string marker.

[5] By “fixed-length,” we don’t mean that the file itself is of a fixed length; it’s each individual record that is of a fixed length. In this section, we’ll use an example file in which every record is 55 bytes long.

[6] Since one byte may have 256 different values, this will hold ages from 0 to 255 with ease. If Methuselah comes to bowl in Bedrock, we’ll have to redesign the database.

[7] We can’t use one-byte integers for the scores, because a bowling score can be as high as 300. Two-byte integers can hold values from 0 to 65535 (if unsigned) or -32768 to 32767 (if signed). We can use some of these extra values as special codes; for example, if a player has only three games on record, the other scores could be set to 9999 to indicate this.

[8] The standard Unix timestamp format (and the time value used by many other systems) is a 32-bit integer, which fits into four bytes, of course. You’ll probably find it handy to use a module to manipulate time and date formats.

[9] Actually, the third parameter is the “whence” parameter. You can use a different value than zero if you want to seek to a position relative to the current position, or relative to the end of the file; see the perlfuncmanpage for more information. Most people will simply want to use zero here.

[10] Perl actually does have a write function, but that is used with formats, which are beyond the scope of this book. See the perlformmanpage.

[11] Of course, the real file wouldn’t have lines at all; it’s one long stream of text. And the newline character should really be a single-character code. But these differences don’t hurt this as an example.

[12] Some of the details of this procedure will vary on non-Unix systems, but the end result should be nearly the same. See the release notes for your port of Perl.

[13] The diamond also tries to duplicate the original file’s permission and ownership settings as much as possible; for example, if the old one was world-readable, the new one should be, as well.

[14] See the perlrunmanpage for the complete list.

[15] Actually, the print occurs in a continue block. See the perlsynand perlrunmanpages for more information.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset