CHAPTER 13

Files and Directories

There are plenty of applications involving files that do not involve opening a filehandle. Examples include copying, moving, or renaming files, and interrogating files for their size, permissions, or ownership. In this chapter, we cover the manipulation of files and directories in ways other than opening filehandles to read or write them using Perl's built-in functions. We also look at the various modules provided as standard with Perl to make file handling both simpler and portable across different platforms, finding files with wildcards through file name globbing, and creating temporary files.

While the bulk of this chapter is concerned with files, we also spend some time looking at directories, which while fundamentally different entities to files, turn out to provide broadly similar techniques to manipulate them.

Querying and Manipulating Files

Files are located in a file system, which stores attributes for every file it references. The most obvious attribute a file has is a file name, but we can also test filing system entries for properties such as their type (file, directory, link) and access permissions with Perl's file test operators. To retrieve detailed information about a file's attributes, we also have the stat and lstat functions at our disposal. Perl also provides built-in functions to manipulate file permissions, ownership, and create or destroy file names within the file system. However, the built-in functions are limited in how versatile they can be on different platforms, because the concepts they embody originate with Unix in mind and are not always portable in themselves.

In addition to the built-in functions, Perl provides a toolkit of modules for handling files portably, regardless of the underlying platform. Some wrap built-in functions, like File::stat, or aggregate many file test operations into a single function call, like File::CheckTree. Others provide useful features such as finding or comparing files, like File::Find and File::Compare. Most of these modules are built on top of File::Spec, which provides basic support for cross-platform file name handling. It is almost always a good idea to use these modules in place of a built-in function whenever portability is a concern.

Beyond basic file handling, Perl also provides the glob function for retrieving file names through wildcard specifications. The built-in glob function is actually implemented in terms of a family of standard modules—each handling a different platform—that we can also use directly for greater control.

A final but important aspect of file handling is the creation and use of temporary files. Apparently simple on the surface, there are several ways to create a temporary file, each with its own advantages, drawbacks, and portability issues.

Getting User and Group Information

Perl provides built-in support for handling user and group information on Unix platforms through the getpwent and getgrent families of functions. This support is principally derived from the under-lying C library functions of the same names, which are in turn dependent on the details of the implementation provided by the operating system. All Unix platforms provide broadly the same features for user and group management, but they vary slightly in what additional information they store. While Perl makes a reasonable attempt to unify all the variations, the system documentation is the best source of information on what values these functions return.

Unix platforms define user and group information in the /etc/passwd and /etc/group files, but this oversimplifies the actual process of looking up user and group information for two reasons. First, if a shadow password file is in use, then the user information in /etc/passwd will not contain an encrypted password in the password field. Second, if alternative sources of user and group information are configured (such as NIS or NIS+), then requesting user or group information may initiate a network lookup to retrieve information from a remote server. The order in which local and remote information sources are consulted is typically defined by the file /etc/nsswitch.conf.

Support for other security models and platforms is not provided through built-in functions, but through extension modules. Windows programmers, for example, can make use of the Win32::AdminMisc module to gain access to the Win32 Security API. Windows and other non-Unix platforms do not support getpwent or getgrent, though the Cygwin environment does provide a veneer of Unix security that allows these functions to work on Windows platforms with limited functionality, enough for Perl programs that use them to function. Access Control Lists (ACLs) and other advanced security features are beyond the reach of the built-in functions even on Unix platforms, but they can be handled via various modules available from CPAN.

User Information

Unix platforms store local user information in the /etc/passwd file (though as noted previously they may also retrieve information remotely). The format varies slightly but typically has a structure like this:

fred:RGdmsaynFgP56:301:200:Fred A:/home/fred:/bin/bash
jim:Edkl1y7NMtO/M:302:200:Jim B:/home/jim:/bin/ksh
mysql:!!:120:120:MySQL server:/var/lib/mysql:/bin/csh

Each line contains the following fields: name, password, user ID, primary group ID, comment/GECOS, home directory, and login shell. In this case, we are not using a shadow password file, so the password field contains an encrypted password. The first two lines are for regular users, while the third defines an identity for a MySQL database server to run as. It does not need a password since it is not intended as a login user, so the password is disabled with !! (* is often also used for this purpose).

The getpwent function (pwent is short for "password entry") retrieves one entry from the user information file at a time, starting from the first. In list context, it returns no less than ten fields:

($name, $passwd, $uid, $gid, $quota, $comment, $gcos, $dir, $shell, $expire)
    = getpwent;

Since the format and source of user information varies, not all these fields are always defined, and some of them have alternate meanings. A summary of each field and its possible meanings is given in Table 13-1; consult the manual page for the passwd file (typically via man 5 passwd) for exact details of what fields are provided on a given platform.

Table 13-1. getpwent Fields

Field Meaning
Name Number
name 0 The login name of the user.
passwd 1 The encrypted password. Depending on the platform, the password may be encrypted using the standard Unix crypt function or the more secure MD5 hashing algorithm. If a shadow password file is in use, this field returns an asterisk. Additionally, disabled accounts often prefix passwords with ! to disable them.
uid 2 The user ID of this user.
gid 3 The primary group of this user. Other groups can be found using the group functions detailed later.
quota 4 The disk space quota allotted to this user. Frequently unsupported. On some systems this may be a change or age field instead.
comment 5 A comment, usually the user's full name. On some systems this may be a class field instead. The comment field is often called the gcos field, but this is not technically accurate; this or the next item may therefore actually contain the comment.
gcos 6 Also known as GECOS, originally standing for "General Electric Computer Operating System," although the original meaning is now of historical interest. An extended comment containing a comma-separated series of values—for example, the user's name, location, and work/home phone numbers. Frequently unimplemented, but see note on comment.
dir 7 The home directory of the user, for example, /home/name.
shell 8 The preferred login shell of the user, for example, /usr/bin/bash.
expire 9 The expiry date of the user account. Frequently unsupported, often undefined.

In scalar context, getpwent returns just the name of the user, that is, the first field. To illustrate, we can generate a list of user names with a program like the following:

#!/usr/bin/perl
# listusers.pl
use warnings;
use strict;

my @users;
while (my $name = getpwent) {
    push @users, $name;
}
print "Users: @users ";

Supporting getpwent are the setpwent and endpwent functions. The setpwent function resets the pointer for the next record returned by getpwent to the start of the password file. It is analogous to the rewinddir function in the same way that getpwent is analogous to both opendir and readdir combined. Since there only is one password file, it takes no arguments:

setpwent;

The endpwent function is analogous to closedir: it closes the internal file pointer created whenever we use getpwent (or getpwnam/getpwuid, detailed in the upcoming text). We cannot get access to this internal filehandle, but it may be freed in order to recapture consumed resources. Additionally, if a network query was made, then this will close the connection:

endpwent;

The getpwnam and getpwuid functions look up user names and user IDs from each other. getpwnam takes a user name as an argument and returns the user ID in scalar context or the full list of ten in a list context:

$uid = getpwnam($username);
@fields = getpwname($username);

Similarly, getpwuid takes a numeric user ID and returns either the name or a list of fields, depending on context:

$username = getpwuid($uid);
@fields = getpwuid($uid);

Both functions also have the same effect as setpwent in that they reset the position of the pointer used by getpwent, so they cannot be combined with it in loops.

Since ten fields is rather a lot to manage, Perl supplies the User::pwent module to provide an object-oriented interface to the pw functions. It is one of several modules that all behave similarly; others are User::grent (for group information), Net::hostent, Net::servent, Net::netent, Net::protoent (for network information), and Stat (for the stat and lstat functions).

User::pwent works by overloading the built-in getpwent, getpwnam, and getpwuid functions with object-oriented methods returning a pw object, complete with methods to extract the relevant fields. It also has the advantage of knowing what methods actually apply, which we can determine using the pw_has class method. Here is an object-oriented user information listing program, which uses getpwent to illustrate how the User::pwent module is used:

#!/usr/bin/perl
# listobjpw.pl
use warnings;
use strict;

use User::pwent qw(:DEFAULT pw_has);

print "Supported fields: ", scalar(pw_has), " ";

while (my $user = getpwent) {
    print 'Name    : ', $user->name, " ";
    print 'Password: ', $user->passwd, " ";
    print 'User ID : ', $user->uid, " ";
    print 'Group ID: ', $user->gid, " ";

    # one of quota, change or age
    print 'Quota   : ', $user->quota, " " if pw_has('quota'),
    print 'Change  : ', $user->change, " " if pw_has('change'),
    print 'Age     : ', $user->age, " " if pw_has('age'),
    # one of comment or class (also possibly gcos is comment)
    print 'Comment : ', $user->comment, " " if pw_has('comment'),
    print 'Class   : ', $user->class, " " if pw_has('class'),

    print 'Home Dir: ', $user->dir, " ";
    print 'Shell   : ', $user->shell, " ";


    # maybe gecos, maybe not
    print 'GECOS   : ',$user->gecos," " if pw_has('gecos'),

    # maybe expires, maybe not
    print 'Expire  : ', $user->expire, " " if pw_has('expire'),

    # separate records
    print " ";
}

If called with no arguments, the pw_has class method returns a list of supported fields in list context, plus a space-separated string suitable for printing in scalar context. Because we generally want to use it without prefixing User::pwent:: we specify it in the import list. However, to retain the default imports that override getpwent and the like, we also need to specify the special :DEFAULT tag.

We can also import scalar variables for each field and avoid the method calls by adding the :FIELDS tag (which also implies :DEFAULT) to the import list. This generates a set of scalar variables with the same names as their method equivalents but prefixed with pw_. The equivalent of the preceding object-oriented script written using field variables is

#!/usr/bin/perl
# listfldpw.pl
use warnings;
use strict;

use User::pwent qw(:FIELDS pw_has);

print "Supported fields: ", scalar(pw_has), " ";

while (my $user = getpwent) {
    print 'Name    : ', $pw_name, " ";
    print 'Password: ', $pw_passwd, " ";
    print 'User ID : ', $pw_uid, " ";
    print 'Group ID: ', $pw_gid, " ";

    # one of quota, change or age
    print 'Quota   : ', $pw_quota, " " if pw_has('quota'),
    print 'Change  : ', $pw_change, " " if pw_has('change'),
    print 'Age     : ', $pw_age, " " if pw_has('age'),

    # one of comment or class (also possibly gcos is comment)
    print 'Comment : ', $pw_comment, " " if pw_has('comment'),
    print 'Class   : ', $pw_class, " " if pw_has('class'),

    print 'Home Dir: ', $pw_dir, " ";
    print 'Shell   : ', $pw_shell, " ";

    # maybe gcos, maybe not
    print 'GCOS    : ', $pw_gecos, " " if pw_has('gecos'),
    # maybe expires, maybe not
    print 'Expire  : ', $pw_expire, " " if pw_has('expire'),

    # separate records
    print " ";
}

We may selectively import variables if we want to use a subset, but since this overrides the default import, we must also explicitly import the functions we want to override:

use User::grent qw($pw_name $pw_uid $pw_gid getpwnam);

To call the original getpwent, getpwnam, and getpwuid functions, we can use the CORE:: prefix. Alternatively, we could suppress the overrides by passing an empty import list or a list containing neither :DEFAULT or :FIELDS. As an example, here is another version of the preceding script that invents a new object method, has, for the Net::pwent package, then uses that and class method calls only, avoiding all imports:

#!/usr/bin/perl
# listcorpw.pl
use warnings;
use strict;

use User::pwent();

sub User::pwent::has {
    my $self = shift;
    return User::pwent::pw_has(@_);
}

print "Supported fields: ", scalar(User::pwent::has), " ";

while (my $user = User::pwent::getpwent) {
    print 'Name    : ', $user->name, " ";
    print 'Password: ', $user->passwd, " ";
    print 'User ID : ', $user->uid, " ";
    print 'Group ID: ', $user->gid, " ";

    # one of quota, change or age
    print 'Quota   : ', $user->quota, " " if $user->has('quota'),
    print 'Change  : ', $user->change, " " if $user->has('change'),
    print 'Age     : ', $user->age, " " if $user->has('age'),

    # one of comment or class (also possibly gcos is comment)
    print 'Comment : ', $user->comment, " " if $user->has('comment'),
    print 'Class   : ', $user->class, " " if $user->has('class'),

    print 'Home Dir: ', $user->dir, " ";
    print 'Shell   : ', $user->shell, " ";

    # maybe gcos, maybe not
    print 'GECOS   : ', $user->gecos, " " if $user->has('gecos'),

    # maybe expires, maybe not
    print 'Expire  : ', $user->expire, " " if $user->has('expire'),
    # separate records
    print " ";
}

As a convenience, the Net::pwent module also provides the getpw subroutine, which takes either a user name or a user ID, returning a user object either way:

$user = getpw($user_name_or_id);

If the passed argument looks numeric, then getpwuid is called underneath to do the work; otherwise, getpwnam is called.

Group Information

Unix groups are a second tier of privileges between the user's own privileges and that of all users on the system. All users belong to one primary group, and files they create are assigned to this group. This information is locally recorded in the /etc/passwd file and can be found locally or remotely with the getpwent, getpwnam, and getpwuid functions as described previously. In addition, users may belong to any number of secondary groups. This information, along with the group IDs (or gids) and group names, is locally stored in the /etc/group file and can be extracted locally or remotely with the getgrent, getgrnam, and getgrgid functions.

The getgrent function reads one entry from the group's file each time it is called, starting with the first and returning the next entry in turn on each subsequent call. It returns four fields, the group name, a password (which is usually not defined), the group ID, and the users who belong to that group:

#!/usr/bin/perl
# listgr.pl
use warnings;
use strict;

while (my ($name, $passwd, $gid, $members) = getgrent) {
    print "$gid: $name [$passwd] $members ";
}

Alternatively, calling getgrent in a scalar context returns just the group name:

#!/usr/bin/perl
# listgroups.pl
use warnings;
use strict;

my @groups;
while (my $name = getgrent) {
    push @groups, $name;
}
print "Groups: @groups ";

As with getpwent, using getgrent causes Perl (or more accurately, the underlying C library) to open a filehandle (or open a connection to an NIS or NIS+ server) internally. Mirroring the supporting functions of getpwent, setgrent resets the pointer of the group filehandle to the start, and endgrent closes the file (and/or network connection) and frees the associated resources.

Perl provides the User::grent module as an object-oriented interface to the getgrent, getgrnam, and getgrid functions. It works very similarly to User::pwent, but it provides fewer methods as it has fewer fields to manage. It also does not have to contend with the variations of field meanings that User::pwent does, and it is consequently simpler to use. Here is an object-oriented group lister using User::getgrent:

#!/usr/bin/perl
# listbigr
use warnings;
use strict;

use User::grent;

while (my $group = getgrent) {
    print 'Name    : ', $group->name, " ";
    print 'Password: ', $group->passwd, " ";
    print 'Group ID: ', $group->gid, " ";
    print 'Members : ', join(', ', @{$group->members}), " ";
}

Like User::pwent (and indeed all similar modules like Net::hostent, etc.), we can import the :FIELDS tag to variables that automatically update whenever any of getgrent, getgrnam, or getgrgid are called. Here is the previous example reworked to use variables:

#!/usr/bin/perl
# listfldgr.pl
use warnings;
use strict;

use User::grent qw(:FIELDS);

while (my $group = getgrent) {
    print 'Name    : ', $gr_name, " ";
    print 'Password: ', $gr_passwd, " ";
    print 'Group ID: ', $gr_gid, " ";
    print 'Members : ', join(', ', @{$group->members}), " ";
}

We can also selectively import variables if we only want to use some of them:

use User::grent qw($gr_name $gr_gid);

In this case, the overriding of getgrent and the like will not take place, so we would need to call User::grent::getgrent rather than just getgrent, or pass getgrent as a term in the import list. To avoid importing anything at all, just pass an empty import list.

The Unary File Test Operators

Perl provides a full complement of file test operators. They test file names for various properties, for example, determining whether they are a file, directory, link, or other kind of file, determining who owns them and what their access privileges are. All of these file tests consist of a single minus followed by a letter, which determines the nature of the test and either a filehandle or a string containing the file name. Here are a few examples:

-r $filename   # return true if file is readable by us
-w $filename   # return true if file is writable by us
-d DIRECTORY   # return true if DIRECTORY is opened to a directory
-t STDIN       # return true if STDIN is interactive

Collectively these functions are known as the -X or file test operators.

The slightly odd-looking syntax comes from the Unix file test utility test and the built-in equivalents in most Unix shells. Despite their strange appearance, the file test operators are really functions that behave just like any other built-in unary (single argument) Perl operator, including support for parentheses:

print "It's a file!" if -f($filename);

If no file name or handle is supplied, then the value of $_ is used as a default, which makes for some very terse if somewhat algebraic expressions:

foreach (@files) {
    print "$_ is readable textfile " if -r && -T;   # -T for 'text' file
}

Only single letters following a minus sign are interpreted as file tests, so there is never any confusion between file test operators and negated expressions:

-o($name)   # test if $name is owned by us
-oct($name)   # return negated value of $name interpreted as octal

The full list of file tests follows, loosely categorized into functional groups. Note that not all of these tests may work, depending on the underlying platform. For instance, operating systems that do not understand ownership in the Unix model will not make a distinction between -r and -R, since this requires the concept of real and effective user IDs. (The Win32 API does support "impersonation," but this is not the same thing and is supported by Windows-specific modules instead.) They will also not return anything useful for -o. Similarly, the -b and -c tests are specific to Unix device files and have no relevance on other platforms.

This tests for the existence of a file:

-e Return true if file exists. Equivalent to the return value of the stat function.

These test for read, write, and execute for effective and real users. On non-Unix platforms, which don't have the concepts of real and effective users, the uppercase and lowercase versions are equivalent:

-r Return true if file is readable by effective user ID.
-R Return true if file is readable by real user ID.
-w Return true if file is writable by effective user ID.
-W Return true if file is writable by real user ID.
-x Return true if file is executable by effective user ID.
-X Return true if file is executable by real user ID.

The following test for ownership and permissions (-o returns 1, others ' ' on non-Unix platforms). Note that these are Unix-based commands. On Windows, files are owned by "groups" as opposed to "users":

-o Return true if file is owned by our real user ID.
-u Return true if file is setuid (chmod u+S, executables only).
-g Return true if file is setgid (chmod g+S. executables only). This does not exist on Windows.
-k Return true if file is sticky (chmod +T, executables only). This does not exist on Windows.

These tests for size work on Windows as on Unix:

-z Return true if file has zero length (that is, it is empty).
-s Return true if file has non-zero length (opposite of -z).

The following are file type tests. While -f, -d, and -t are generic, the others are platform dependent:

-f Return true if file is a plain file (that is, not a directory, link, pipe, etc.).
-d Return true if file is a directory.
-l Return true if file is a symbolic link.
-p Return true if file is a named pipe or filehandle is a pipe filehandle.
-S Return true if file is a Unix domain socket or filehandle is a socket filehandle.
-b Return true if file is a block device.
-c Return true if file is a character device.
-t Return true if file is interactive (opened to a terminal).

The -T and -B tests determine whether a file is text or binary (for details see "Testing Binary and Text Files" coming up shortly):

-T Return true if file is a text file.
-B Return true if file is not a text file.

The following tests return timestamps, and also work on Windows:

-M

Return the age of the file as a fractional number of days, counting from the time at which the application started (which avoids a system call to find the current time). To test which of two files is more recent, we can write

$file = (-M $file1 > -M $file2)? $file1: $file2;
-A Return last access time.
-C On Unix, return last inode change time. (Not creation time, as is commonly misconceived. This does return the creation time, but only so long as the inode has not changed since the file was created.) On other platforms, it returns the creation time.

Link Transparency and Testing for Links

This section is only relevant if our chosen platform supports the concept of symbolic links, which is to say all Unix variants but not most other platforms. In particular, Windows "shortcuts" are an artifact of the desktop and unfortunately have nothing to do with the actual filing system.

The stat function, which is the basis of all the file test operators (except -l) automatically follows symbolic links and returns information based on the real file, directory, pipe, etc., that it finds at the end of the link. Consequently, file tests like -f and -d return true if the file at the end of the link is a plain file or directory. Therefore we do not have to worry about links when we just want to know if a file is readable:

my @lines;
if (-e $filename) {
    if (-r $filename) {
        open FILE, $filename;   # open file for reading
        @lines = <FILE>;
    } else {
        die "Cannot open $filename for reading ";
    }
} else {
    die "Cannot open $filename - file does not exist ";
}

If we want to find out if a file is actually a link, we have to use the -l test. This gathers infor-mation about the link itself and not the file it points to, returning true if the file is in fact a link. A practical upshot of this is that we can test for broken links by testing -l and -e:

if (-l $file and !-e $file) {
    print "'$file' is a broken link! ";
}

This is also useful for testing that a file is not a link when we do not expect it to be. A utility designed to be run under "root" should check that files it writes to have not been replaced with links to /etc/passwd, for example.

Testing Binary and Text Files

The -T and -B operators test files to see if they are text or binary. They do this by examining the start of the file and counting the number of nontext characters present. If this number exceeds one third, the file is determined to be binary; otherwise, it is determined to be text. If a null (ASCII 0) character is seen anywhere in the examined data, then the file is assumed to be binary.

Since -T and -B only make sense in the context of a plain file, they are commonly combined with -f:

if (-f $file && -T $file) {
   ...
}

-T and -B differ from the other file test operators in that they perform a read of the file in question. When used on a filehandle, both tests read from the current position of the file pointer. An empty file or a filehandle positioned at the end of the file will return true for both -T and -B, since in these cases there is no data to determine which is the correct interpretation.

Reusing the Results of a Prior Test

The underlying mechanism behind the file test operators is a call to either stat or, in the case of -l, lstat. In order to test the file, each operator will make a call to stat to interrogate the file for information. If we want to make several tests, this is inefficient, because a disc access needs to be made in each case.

However, if we have already called stat or lstat for the file we want to test, then we can avoid these extra calls by using the special filehandle _, which will substitute the results of the last call to stat (or lstat) in place of accessing the file. Here is a short example that tests a file name in six different ways based on one call to lstat:

#!/usr/bin/perl
# statonce.pl
use warnings;
use strict;

print "Enter filename to test: ";
my $filename = <>;
chomp $filename;

if (lstat $filename) {
    print "$filename is a file " if -f _;
    print "$filename is a directory " if -d _;
    print "$filename is a link " if -l _;

    print "$filename is readable " if -r _;
    print "$filename is writable " if -w _;
    print "$filename is executable " if -x _;
} else {
    print "$filename does not exist ";
}

Note that in this example we have used lstat, so the link test -l _ will work correctly. -l requires an lstat and not a stat, and it will generate an error if we try to use it with the results of a previous stat:


The stat preceding -l _ wasn't an lstat...

Caching of the results of stat and lstat works for prior file tests too, so we could also write something like this:

if (-e $filename) {
   print "$filename exists ";
   print "$filename is a file " if -f _;
}

Or:

if (-f $filename && -T _) {
   print "$filename exists and is a text file ";
}

The only drawback to this is that only -l calls lstat, so we cannot test for a link this way unless the first test is -l.

Access Control Lists, the Superuser, and the filestat Pragma

The file tests -r, -w, and -x and their uppercase counterparts determine their return value from the results of the stat function. Unfortunately, this does not always produce an accurate result. Some of the reasons that these file tests may produce incorrect or misleading results include

  • An ACL is in operation.
  • The file system is read-only.
  • We have superuser privileges.

All these cases tend to produce "false positive" results, implying that the file is accessible when in fact it is not. For example, the file may be writable, but the file system is not.

In the case of the superuser, -r, -R, -w, and -W will always return true, even if the file is set as unreadable and unwritable, because the superuser can just disregard the actual file permissions. Similarly, -x and -X will return true if any of the execute permissions (user, group, other) are set. To check if the file is really writable, we must use stat and check the file permissions directly:

$mode = ((stat $filename)[2]);
$writable = $mode & 0200;   # test for owner write permission

Tip Again, this is a Unix-specific example. Other platforms do not support permissions or support them in a different way.


For the other cases, we can try to use the filetest pragma, which alters the operation of the file tests for access by overriding them with more rigorous tests that interrogate the operating system instead. Currently there is only one mode of operation, access, which causes the file test operators to use the underlying access system call, if available:

use filetest 'access';

This modifies the behavior of the file test operators to use the operating system's access call to check the true permission of a file, as modified by access control lists, or file systems that are mounted read-only. It also provides an access subroutine, which allows us to make our own direct tests of file names (note that it does not work on filehandles). It takes a file name and a numeric flag containing the permissions we want to check for. These are defined as constants in the POSIX module and are listed in Table 13-2.

Table 13.2. POSIX File Permission Constants

Constant Description
R_OK Test file has read permission.
W_OK Test file has write permission.
X_OK Test file has execute permission.
F_OK Test that file exists. Implied by R_OK, W_OK, or X_OK.

Note that F_OK is implied by the other three, so it need never be specified directly (to test for existence, we can as easily use the -e test, or -f if we require a plain file).

While access provides no extra functionality over the standard file tests, it does allow us to make more than one test simultaneously. As an example, to test that a file is both readable and writable, we would use

use filetest 'access';use POSIX;
...
$can_readwrite = access($filename, R_OK|W_OK);

The return value from access is undef on failure and "0 but true" (a string that evaluates to zero in a numeric context and true in any other) on success, for instance, an if or while condition. On failure, $! is set to indicate the reason.

Automating Multiple File Tests

We often want to perform a series of different file tests across a range of different files. Installation scripts, for example, often do this to verify that all the installed files are in the correct place and with the correct permissions.

While it is possible to manually work through a list of files, we can make life a little simpler by using the File::CheckTree module instead. This module provides a single subroutine, validate, that takes a series of file names and -X style file tests and applies each of them in turn, generating warnings as it does so.

Unusually for a library subroutine, validate accepts its input in lines, in order to allow the list of files and tests to be written in the style of a manifest list. In the following example, validate is being used to check for the existence of three directories and an executable file installed by a fictional application:

$warnings = validate(q{
    /home/install/myapp/scripts -d
    /home/install/myapp/docs -d
    /home/install/myapp/bin -d
    /home/install/myapp/bin/myapp -fx
});

validate returns the number of warnings generated during the test, so we can use it as part of a larger installation script. If we want to disable or redirect the warnings, we can do so by defining a signal handler:

$SIG{__WARN__} = { };   # do nothing
$SIG{__WARN__} = {print LOGFILE @_};   # redirect to install log

The same file may be listed any number of times, with different tests applied each time. Alternatively, multiple tests may be bunched together into one file test, so that instead of specifying two tests one after the other, they can be done together. Hence, instead of writing two lines:

/home/install/myapp/bin/myapp -f
/home/install/myapp/bin/myapp -x

we can write both tests as one line:

/home/install/myapp/bin/myapp -fx

The second test is dependent on the first, so only one warning can be generated from a bunched test. If we want to test for both conditions independently (we want to know if it is not a plain file, and we also want to know if it is not executable), we need to put the tests on separate lines.

Tests may also be negated by prefixing them with a !, in which case all the individual tests must fail for the line to succeed. For example, to test whether a file is neither setuid or setgid:

validate(q{
   /home/install/myapp/scripts/myscript.pl   !-ug
})

Normal and negated tests cannot be bunched, so if we want to test that a file name corresponds to a plain file that is not executable, we must use separate tests:

validate(q{
   /home/install/myapp/scripts/myscript.pl   -f
   /home/install/myapp/scripts/myscript.pl   !-xug
})

Rather than a file test operator, we can also supply the command cd. This causes the directory named at the start of the line to be made the current working directory. Any relative paths given after this are taken relative to that directory until the next cd, which may also be relative:

validate(q{
   /home/install/myapp     cd || die
      scripts         -rd
      cgi              cd
      guestbook.cgi   -xg
      guestbook.cgi   !-u
      ..               cd
      about_us.html   -rf
      text.bin        -f  || warn "Not a plain file"
});

Tip validate is insensitive to extra whitespace, so we can use additional spacing to clarify what file is being tested where. In the preceding example, we have indented the files to make it clear which directory they are being tested in.


We can supply our own warnings and make tests fatal by suffixing the file test with || and either warn or die. These work in exactly the same way as their Perl function counterparts. If our own error messages are specified, we can use the variable $file, supplied by the module, to insert the name of the file whose test failed:

validate(q{
   /etc         -d   || warn "What, no $file directory? "
   /var/spool   -d   || die
})

This trick relies on the error messages being interpolated at run time, so using single quotes or the q quoting operator is essential in this case.

One of the advantages of File::CheckTree is that the file list can be built dynamically, possibly generated from an existing file tree created by File::Find (detailed in "Finding Files" later in this chapter). For example, using File::Find, we can determine the type and permissions of each file and directory in a tree, then generate a test list suitable for File::CheckTree to validate new installations of that tree. See the section "Finding Files" and the other modules in this section for pointers.

Interrogating Files with stat and lstat

If we want to know more than one attribute of a file, we can skip multiple file test operators and instead make use of the stat or lstat functions directly. Both functions return details of the file name or filehandle supplied as their argument. lstat is identical to stat except in the case of a symbolic link, where stat will return details of the file pointed to by the link and lstat will return details of the link itself. In either case, a 13-element list is returned:

# stat filehandle into a list
@stat_info = stat FILEHANDLE;

# lstat file name into separate scalars
($dev, $ino, $mode, $nlink, $uid, $gid, $rdev, $size,
    $time, $mtime, $ctime, $blksize, $blocks) = lstat $filename;

The stat function will also work on a filehandle, though the information returned is greatly influenced by the type of filehandle under interrogation:

my @stdin_info=stat STDIN; # stat standard input
opendir CWD, ".";
my @cwd_info=stat CWD;     # stat a dir handle

Note The lstat function will not work on a filehandle and will generate a warning if we try. lstat only makes sense for actual files since it is concerned with symbolic links, a filing system concept that does not translate to filehandles.


The thirteen values are always returned, but they may not be defined or have meaning in every case, either because they do not apply to the file or filehandle being tested or because they have no meaning on the underlying platform. Thirteen values is a lot, so the File::stat module provides an object-oriented interface that lets us refer to these values by name instead. The full list of values, including the meanings and index number, is provided in Table 13-3. The name in the first column is the conventional variable name used previously and also the name of the method provided by the File::stat module.

Table 13.3. stat Fields

Method Number Description
dev 0 The device number of the file system on which the file resides.
ino 1 The inode of the file.
mode 2 The file mode, combining the file type and the file permissions.
nlink 3 The number of hard (not symbolic) references to the inode underneath the file name.
uid 4 The user ID of user that owns the file.
gid 5 The group ID of group that owns the file.
rdev 6 The device identifier (block and character special files only).
size 7 The size of the file, in bytes.
atime 8 The last access time, in seconds.
mtime 9 The last modification time, in seconds.
ctime 10 The last inode change time, in seconds.
blksize 11 The preferred block size of the file system.
blocks 12 The number of blocks allocated to the file. The product of $stat_info[11]*$stat_info[12] is the size of the file as allocated in the file system. However, the actual size of the file in terms of its contents will most likely be less than this as it will only partially fill the last block—use $stat_info[7] (size) to find that.

Several of the values returned by stat relate to the "inode" of the file. Under Unix, the inode of a file is a numeric ID, which it is allocated by the file system, and which is its "true" identity, with the file name being just an alias. On platforms that support it, more than one file name may point to the same file, and the number of hard links is returned in the nlink value and may be greater than one, but not less (since that would mean the inode had no file names). The ctime value indicates the last time the node of the file changed. It may often mean the creation time. Conversely, the access and modification times refer to actual file access.

On other platforms, some of these values are either undefined or meaningless. Under Windows, the device number is related to the drive letter, there is no inode, and the value of nlink is always 1. The uid and gid values are always zero, and no value is returned for either blocksize or blocks, either. There is a mode, though only the file type is useful; the permissions are always 777. While Windows NT/2000/XP supports a fairly complex permissions system, it is not accessible this way; the Win32::FileSecurity and Win32::FilePermissions modules must be used instead. Accessing the values returned by stat can be a little inconvenient, not to mention inelegant. For example, this is how we find the size of a file:

$size = (stat $filename)[7];

Or, printing it out:

print ((stat $filename)[7]);   # need to use extra parentheses with print

Unless we happen to know that the eighth element is the size or we are taking care to write particularly legible code, this leads to unfriendly code. Fortunately, we can use the File::stat module instead.

Using stat Objects

The File::stat module simplifies the use of stat and lstat by overriding them with subroutines that return stat objects instead of a list. These objects can then be queried using one of File::stat's methods, which have the same names as the values that they return.

As an example, this short program uses the size, blksize, and blocks methods to return the size of the file supplied on the command line:

#!/usr/bin/perl
# filesize.pl
use warnings;
use strict;

use File::stat;

print "Enter filename: ";
my $filename = <>;
chomp $filename;
if (my $stat = stat $filename) {
    print "'$filename' is ", $stat->size,
          " bytes and occupies ", $stat->blksize * $stat->blocks,
          " bytes of disc space ";
} else {
    print "Cannot stat $filename: $| ";
}

As an alternative to using object methods, we can import 13 scalar variables containing the results of the last stat or lstat into our program by adding an import list of :FIELDS. Each variable takes the same name as the corresponding method prefixed with the string st_. For example:

#!/usr/bin/perl
# filesizefld.pl
use warnings;
use strict;

use File::stat qw(:FIELDS);

print "Enter filename: ";
my $filename = <>;
chomp($filename);
if (stat $filename) {
    print "'$filename' is ", $st_size,
          " bytes and occupies ", $st_blksize * $st_blocks,
          " bytes of disc space ";
} else {
    print "Cannot stat $filename: $| ";
}

The original versions of stat and lstat can be used by prefixing them with the CORE:: package name:

use File::stat;

...

my @new_stat = stat $filename;   # use new 'stat'
my @old_stat = CORE::stat $filename;   # use original 'stat'

Alternatively, we can prevent the override from happening by supplying an empty import list:

use File::stat qw();   # or '', etc.

We can now use the File::stat stat and lstat methods by qualifying them with the full package name:

my $stat = File::stat::stat $filename;
print "File is ", $stat->size(), " bytes ";

Changing File Attributes

There are three basic kinds of file attribute that we can read and attempt to modify: ownership, access permissions, and the access and modification timestamps. Unix and other platforms that support the concept of file permissions and ownership can make use of the chmod and chgrp functions to modify the permissions of a file from Perl. chmod modifies the file permissions of a file for the three categories: user, group, and other. The chown function modifies which user corresponds to the user permissions and which group corresponds to the group permissions. Every other user and group falls under the other category. Ownership and permissions are therefore inextricably linked and are combined into the mode value returned by stat.

File Ownership

File ownership is a highly platform-dependent concept. Perl grew up on Unix systems, and so it attempts to handle ownership in a Unix-like way. Under Unix and other platforms that borrowed their semantics from Unix, files have an owner, represented by the file's user ID, and a group owner, represented by the file's group ID. Each relates to a different set of file permissions, so the user may have the ability to read and write a file, whereas other users in the same group may only get to read it. Others may not have even that, depending on the setting of the file permissions.

File ownership is handled by the chown function, which maps to both the chown and chgrp system calls. It takes at least three parameters, a user ID, a group ID, and one or more files to change:

my @successes = chown $uid, $gid, @files;

The number of files successfully changed is returned. If only one file is given to chown, this allows a simple Boolean test to be used to determine success:

unless (chown $uid, $gid, $filename) {
   die "chown failed: $! ";
}

To change only the user or group, supply -1 as the value for the other parameter. For instance, a chgrp function can be simulated with

sub chgrp {
   return chown(shift, -1, @_);
}

Note that on most systems (that is, most systems that comprehend file ownership in the first place), usually only the superuser can change the user who owns the file, though the group can be changed to another group that the same user belongs to. It is possible to determine if a change of ownership is permitted by calling the sysconf function:

my $chown_restricted = sysconf(_PC_CHOWN_RESTRICTED);

If this returns a true value, then a chown will not be permitted.

chown needs a user or group ID to function; it will not accept a user or group name. To deduce a user ID from the name, at least on a Unix-like system, we can use the getpwnam function. Likewise, to deduce a group ID from the name, we can use the getgrnam function. We can use getpwent and getgrent instead to retrieve one user or group respectively, as we saw in the section "Getting User and Group Information" earlier in the chapter. As a quick example, the following script builds tables of user and group IDs, which can be subsequently used in chown:

#!/usr/bin/perl
use warnings;
use strict;

# get user names and primary groups
my (%users, %usergroup);
while (my ($name, $passwd, $uid, $gid) = getpwent) {
    $users{$name} = $uid;
    $usergroup{$name} = $gid;
}


# get group names and gids
my (%groups, @groups);
while (my ($name, $passwd, $gid) = getgrent) {
    $groups{$name} = $gid;
    $groups[$gid] = $name;
}

# print out basic user and group information
foreach my $user (sort {$users{$a} <=> $users{$b}} keys %users) {
    print "$users{$user}: $user, group ", $usergroup{$user},
        " (", $groups[$usergroup{$user}], ") ";
}

File Permissions

Perl provides two functions that are specifically related to file permissions, chmod and umask. As noted earlier, these will work for any Unix-like platform, including MacOS X, but not Windows, where the Win32::FileSecurity and Win32::FilePermissions modules must be used. The chmod function allows us to set the permissions of a file. Permissions are grouped into three categories: user, which applies to the file's owner, group, which applies to the file's group owner, and other, which applies to anyone who is not the file's owner or a member of the file's group owner. Within each category each file may be given read, write, and execute permission.

chmod represents each of the nine values (3 categories × 3 permissions) by a different numeric flag, which are traditionally put together to form a three-digit octal number, each digit corresponding to the respective category. The flag values within each digit are 4 for read permission, 2 for write permission, and 1 for execute permission, as demonstrated by the following examples (prefixed by a leading 0 to remind us that these are octal values):

0200 Owner write permission
0040 Group read permission
0001 Other execute permission

The total of the read, write, and execute permissions for a category is 7, which is why octal is so convenient to represent the combined permissions flag. Read, write, and execute permission for the owner only would be represented as 0700. Similarly, read, write, and execute permission for the owner, read and execute permission for the group, and execute-only permission for everyone else would be 0751, which is 0400 + 0200 + 0100 + 0040 + 0010 + 0001.

Having explained the permissions flag, the chmod function itself is comparatively simple, taking a permissions flag, as calculated previously, as its first argument and applying it to one or more files given as the second and subsequent arguments. For example:

chmod 0751, @files;

As with chown, the number of successfully chmodded files is returned, or zero if no files were changed successfully. If only one file is supplied, the return value of chmod can be tested as a Boolean result in an if or unless statement:

unless (chmod 0751, $file) {
    die "Unable to chmod: $! ";
}

The umask function allows us to change the default permissions mask used whenever Perl creates a new file. The bits in the umask have the opposite meaning to the permissions passed to chmod. They unset the corresponding bits in the permissions from the permissions used by open or sysopen when the file is created and the resulting permissions set. Thus the permission bits of the umask mask the permissions that open and sysopen try to set. Table 13-4 shows the permission bits that can be used with umask and their meanings.

Table 13-4. umask File Permissions

umask Number File Permission
0 Read and write
1 Read and write
2 Read only
3 Read only
4 Write only
5 Write only
6 No read and no write
7 No read and no write

umask only defines the access permissions. Called without an argument, it returns the current value of the umask, which is inherited from the shell and is typically set to a value of 002 (mask other write permission) or 022 (mask group and other write permissions):

$umask = umask;

Alternatively, umask may be called with a single numeric parameter, traditionally expressed in octal or alternatively as a combination of mode flags as described previously. For example:

umask 022;

Overriding the umask explicitly is not usually a good idea, since the user might have it set to a more restrictive value. A better idea is to combine the permissions we want to restrict with the existing umask, using a bitwise OR. For example:

umask (022 | umask);

The open function always uses permissions of 0666 (read and write for all categories), whereas sysopen allows the permissions to be specified in the call. Since umask controls the permissions of new files by removing unwanted permissions, we do not need to (and generally should not) specify more restrictive permissions to sysopen.

File Access Times

The built-in utime function provides the ability to change the last access and last modification time of one or more files. It takes at least three arguments: the new access time, in seconds since 1970/1/1 00:00:00, the new modification time, also in seconds, and then the file or files whose times are to be changed. For example:

my $onedayago=time - 24*60*60;
utime $onedayago, time(), "myfile.txt", "my2ndfile.txt";

This will set the specified files to have a last access time of exactly a day ago and a last modification time of right now. From Perl 5.8, we can also specify undef to mean "right now," so to emulate the Unix touch command on all C or C++ files in the current directory, we could use

utime undef, undef, <*.c>, <*.cpp>

The Fcntl Module

The Fcntl module provides symbolic constants for all of the flags contained in both the permissions and the file type parts of the mode value. It also provides two functions for extracting each part, as an alternative to computing the values by hand:

use Fcntl qw(:mode);       # import file mode constants

my $type = IFMT($mode);    # extract file type
my $perm = IFMODE($mode);  # extract file permissions

printf "File permissions are: %o ", $perm;

The file type part of the mode defines the type of the file and is the basis of the file test operators like -d, -f, and -l that test for the type of a file. The Fcntl module defines symbolic constants for these, and they are summarized in Table 13-5.

Table 13-5. Fcntl Module File Test Symbols

Name Description Operator
S_IFREG Regular file -f
S_IFDIR Directory -d
S_IFLNK Link -l
S_IFBLK Block special file -b
S_IFCHR Character special file -c
S_IFIFO Pipe or named fifo -p
S_IFSOCK Socket -S
S_IFWHT Interactive terminal -t

Note that Fcntl also defines a number of subroutines that test the mode for the desired property. These have very similar names, for example, S_IFDIR and S_ISFIFO, and it is easy to get the subroutines and flags confused. Since we have the file test operators, we do not usually need to use these subroutines, so we mention them only to eliminate possible confusion.

These flags can also be used with sysopen, IO::File's new method, and the stat function described previously, where they can be compared against the mode value. As an example of how these flags can be used, here is the equivalent of the -d file test operator written using stat and the Fcntl module:

my $mode = ((stat $filename)[2]);
my $is_directory = $mode & S_IFDIR;

Or, to test that a file is neither a socket or a pipe:

my $is_not_special = $mode & ^(S_IFBLK | S_IF_CHR);

The Fcntl module also defines functions that do this for us. Each function takes the same name as the flag but with S_IF replaced with S_IS. For instance, to test for a directory, we can instead use

my $is_directory = S_ISDIR($mode);

Of course, the -d file test operator is somewhat simpler in this case.

The permissions part of the mode defines the read, write, and execute privileges that the file grants to the file's owner, the file's group, and others. It is the basis of the file test operators like -r, -w, -u, and -g that test for the accessibility of a file. The Fcntl module also defines symbolic constants for these, summarized in Table 13-6.

Table 13-6. Fcntl Module File Permission Symbols

Name Description Number
S_IRUSR User can read. 00400
S_IWUSR User can write. 00200
S_IXUSR User can execute. 00100
S_IRGRP Group can read. 00040
S_IWGRP Group can write. 00020
S_IXGRP Group can execute. 00010
S_IROTH Others can read. 00004
S_IWOTH Others can write. 00002
S_IXOTH Others can execute. 00001
S_IRWXU User can read, write, execute. 00700
S_IRWXG Group can read, write, execute. 00070
S_IRWXO Others can read, write, execute. 00007
S_ISUID Setuid. 04000
S_ISGID Setgid. 02000
S_ISVTX Sticky (S) bit. 01000
S_ISTXT Swap (t) bit. 10000

For example, to test a file for user read and write permission, plus execute permission, we could use

$perms_ok = $mode & S_IRUSR | S_IWUSR | S_IRGRP;

To test that a file has exactly these permissions and no others, we would instead write

$exact_perms = $mode == S_IRUSR | S_IWUSR | S_IRGRP;

The file permission flags are useful not only for making sense of the mode value returned by stat, but also as inputs for the chmod function. Consult the manual page for the chmod system call (on Unix platforms) for details of the more esoteric bits such as sticky and swap.

Linking, Unlinking, Deleting, and Renaming Files

The presence of file names can be manipulated directly with the link and unlink built-in functions. These provide the ability to edit the entries for files in the file system, creating new ones or removing existing ones. They are not the same as creating and deleting files, however. On platforms that support the concept, link creates a new link (entry in the filing system) to an existing file, it does not create a copy (except on Windows, where it does exactly this). Likewise, unlink removes a file name from the filing system, but if the file has more than one link, and therefore more than one file name, the file will persist. This is an important point to grasp, because it often leads to confusion.

Linking Files

The link function creates a new link (sometimes called a hard link, to differentiate it from a soft or symbolic link) for the named file. It only works on platforms that support multiple hard links for the same file:

if (link $currentname, $newname) {
    print "Linked $currentname to $newname ok ";
} else {
    warn "Failed to link: $! ";
}

link will not create links for directories, though it will create links for all other types of files. For directories, we can create symbolic links only. Additionally, we cannot create hard links between different file systems and not between directories on some file systems (for example, AFS). On Unix, link works by giving two names in the file system the same underlying inode. On Windows and other file systems that do not support this concept, an attempt to link will create a copy of the original file.

On success, link returns true, and a new file name will exist for the file. The old one continues to exist and can either be used to read or alter the contents of the file. Both links are therefore exactly equivalent. Immediately after creation, the new link will carry the same permissions and ownership as the original, but this can subsequently be changed with the chmod and chown built-in functions to, for example, create a read-only and a read-write entry point to the same data.

Deleting and Unlinking Files

The opposite of linking is unlinking. Files can be unlinked with the built-in unlink function, which takes one or more file names as a parameter. If no file name is supplied, unlink uses $_:

unlink $currentname;   # single file

foreach (<*.*>) {
    unlink if /.bak$/;   # unlink $_ if it ends '.bak'
}

unlink <*.bak>;   # the same, via a file glob

On platforms where unlinking does not apply (because multiple hard links are not permissible), unlink simply deletes the file. Otherwise, unlink is not necessarily the same as deleting a file, for two reasons. First, if the file has more than one link, then it will still be available by other names in the file system. Although we cannot (easily) find out the names of the other links, we can find out how many links a file has through stat. We can establish in advance if unlink will really delete the file or just remove one of its links by calling stat:

my $links = (stat $filename)[3];

Or more legibly with the File::stat module:

my $stat = new File::stat($filename);
my $links = $stat->nlink;

Second, on platforms that support it (generally Unix-like ones), if any process has an open filehandle for the file, then it will persist for as long as the filehandle persists. This means that even after an unlink has completely removed all links to a file, it will still exist and can be read, written, and have its contents copied to a new file. Indeed, the new_tmpfile method of IO::File does exactly this if it is possible and true anonymous temporary files are not available—"Temporary Files" covers this in detail later in this chapter. On other platforms (for example, Windows), Perl will generally reject the attempt to unlink the file so long as a process holds an open filehandle on it. Do not rely on the underlying platform allowing a file to be deleted while it is still open; close it first to be sure.

The unlink function will not unlink directories unless three criteria are met: we are on Unix, Perl was given the -U flag, and we have superuser privilege. Even so, it is an inadvisable thing to do, since it will also remove the directory contents, including any subdirectories and their contents from the filing system hierarchy, but it will not recycle the same space that they occupy on the disc. Instead they will appear in the lost+found directory the next time an fsck filing system check is performed, which is unlikely to be what we intended. The rmdir built-in command covered later in the chapter is the preferred approach, or see the rmtree function from File::Path for more advanced applications involving multiple directories.

Renaming Files

Given the preceding, renaming a file is just a case of linking it to a new name, then unlinking it from the old, at least under Unix. The following subroutine demonstrates a generic way of doing this:

sub rename {
    my ($current, $new) = @_;
    unlink $current if link($current, $new);
}

The built-in rename function is essentially equivalent to the preceding subroutine:

# using the built-in function:
rename($current, $new);

This is effective for simple cases, but it will fail in a number of situations, most notably if the new file name is on a different file system from the old (a floppy disk to a hard drive, for instance). rename uses the rename system call, if available. It may also fail on (non-Unix) platforms that do not allow an open file to be renamed.

For a properly portable solution that works across all platforms, consider using the move routine from the File::Copy module. For the simpler cases it will just use rename, but it will also handle special cases and platform limitations.

Symbolic Links

On platforms that support it, we can also create a soft or symbolic link with the built-in symlink function. This is syntactically identical to link but creates a pointer to the file rather than a direct hard link:

if (symlink $currentname, $newname) {
    die "Failed to link: $! ";
}

The return value from symlink is 1 on success or 0 on failure. On platforms that do not support symbolic links (a shortcut is an invention of the Windows desktop, not the file system), symlink produces a fatal error. If we are writing code to be portable, then we can protect against this by using eval:

$linked = eval {symlink($currentname, $newname);};
if (not defined $linked) {
    warn "Symlink not supported ";
} else {
    warn "Link failed: $! ";
}

To test whether symlink is available without actually creating a symbolic link, supply an empty file name for both arguments:

my $symlinking = eval {symlink('',''), 1};

If the symlink fails, eval will return undef when it tries to execute the symlink. If it succeeds, the 1 will be returned. This is a generically useful trick for all kinds of situations, of course.

Symbolic links are the links that the -l and lstat functions check for; hard links are indistinguishable from ordinary file names because they are ordinary file names. Most operations performed on symbolic links (with the notable exceptions of -l and lstat of course) are transferred to the linked file, if it exists. In particular, symbolic links have the generic file permissions 777, meaning everyone is permitted to do everything. However, this only means that the permissions of the file that the link points towards take priority. An attempt to open the link for writing will be translated into an attempt to open the linked file and check its permissions rather than those of the symbolic link. Even chmod will affect the permissions of the real file, not the link.

Symbolic links may legally point to other symbolic links, in which case the end of the link is the file that the last symbolic link points to. If the file has subsequently been moved or deleted, the symbolic link is said to be "broken." We can check for broken links with

if (-l $linkname and !-e $linkname) {
    print "$linkname is a broken link! ";
}

See "Interrogating Files with stat and lstat" earlier in the chapter for more on this (and in particular why the special file name _ cannot be used after -e in this particular case) and some variations on the same theme.

Copying and Moving Files

One way to copy a file to a new name is to open a filehandle for both the old and the new names and copy data between them, as this rather simplistic utility attempts to do:

#!/usr/bin/perl
# dumbcopy
use warnings;
use strict;

print "Filename: ";
my $infile = <>;
chomp $infile;
print "New name: ";
my $outfile = <>;
chomp $outfile;
open IN, $infile;
open OUT, "> $outfile";
print OUT <IN>;
close IN;
close OUT;

The problem with this approach is that it does not take into account the existing file permissions and ownerships. If we run this on a Unix platform and the file we are copying happens to be executable, the copy will lose the executable permissions. If we run this on a system that cares about the difference between binary and text files, the file can become corrupted unless we also add a call to binmode. Fortunately, the File::Copy module handles these issues for us.

The File::Copy module provides subroutines for moving and copying files without having to directly manipulate them via filehandles. It also correctly preserves the file permissions. To make use of it, we just need to use it:

use File::Copy;

File::Copy contains two primary subroutines, copy and move. copy takes the names of two files or filehandles as its arguments and copies the contents of the first to the second, creating it if necessary. If the first argument is a filehandle, it is read from; if the second is a filehandle, it is written to. For example:

copy "myfile", "myfile2";  # copy one file to another
copy "myfile", *STDOUT;   # copy file to standard output
copy LOG, "logfile";       # copy input to filehandle

If neither argument is a filehandle, copy does a system copy in order to preserve file attributes and permissions. This copy is directly available as the syscopy subroutine and is portable across platforms, as we will see in a moment.

copy also takes a third, optional argument, which if specified determines the buffer size to use. For instance, to copy the file in chunks of 16K, we might use

copy "myfile", "myfile2", 16 * 1024;

Without a buffer size, copy will default to the size of the file, or 2MB, whichever is smaller. Setting a smaller buffer will cause the copy to take longer, but it will use less memory while doing it.

move takes the names of two files (not filehandles) as its arguments and attempts to move the file named by the first argument to have the name given as the second. For example:

move "myfile", "myfile2";   # move file to another name

If possible, move will rename the file using the link and unlink functions. If not, it will copy the file using copy and then delete the original. Note, however, that in this case we cannot set a buffer size as an optional third parameter.

If an error occurs with either copy or move, the file system may run out of space. Then the destination file may be incomplete. In the case of a move that tried to copy the file, this will lose information. In this case, attempting to copy the file and then unlinking the original is safer.

On platforms that care about binary and text files (for example, Windows), to make a copy explicitly binary, use binmode or make use of the open pragmatic module described earlier in the chapter.

Here is a rewritten version of the file copy utility we started with. Note that it is not only better, but also it is considerably smaller:

#!/usr/bin/perl
# smartcopy.pl
use warnings;
use strict;

use File::Copy;

print "Filename: ";
my $infile = <>;
chomp $infile;
print "New name: ";
my $outfile = <>;
chomp $outfile;

unless (copy $infile, $outfile) {
   print "Failed to copy '$infile' to '$outfile': $! ";
}

As a special case, if the first argument to copy or move is a file name and the second is a directory, then the destination file is placed inside the directory with the same name as the source file.

Unix aficionados will be happy to know that the aliases cp and mv are available for copy and move and can be imported by specifying one or both of them in the import list:

use File::Copy qw(cp mv);

System Level Copies and Platform Portability

As well as the standard copy, which works with either file names or filehandles, File::Copy defines the syscopy subroutine, which provides direct access to the copy function of the underlying operating system. The copy subroutine calls syscopy if both arguments are file names and the second is not a directory (as seen in the previous section); otherwise, it opens whichever argument is not a filehandle and performs a read-write copy through the filehandles.

The syscopy calls the underlying copy system call supplied by the operating system and is thus portable across different platforms. Under Unix, it calls the copy subroutine, as there is no system copy call. Under Windows, it calls the Win32::CopyFile module. Under OS/2 and VMS, it calls syscopy and rmscopy, respectively. This makes the File::Copy module an effective way to copy files without worrying about platform dependencies.

Comparing Files

The File::Compare module is a standard member of the Perl standard library that provides portable file comparison features for our applications. It provides two main subroutines, compare and compare_text, both of which are available when using the module:

use File::Compare;

The compare subroutine simply compares two files or filehandles byte for byte, returning 0 if they are equal, 1 if they are not, and -1 if an error was encountered:

SWITCH: foreach (compare $file1, $file2) {
    /^0/ and print("Files are equal"), last;
    /^1/ and print("Files are not equal"), last;
    print "Error comparing files: $! ";
}

compare also accepts a third optional argument, which if specified defines the size of the buffer used to read from the two files or filehandles. This works in an identical manner to the buffer size of File::Copy's copy subroutine, defaulting to the size of the file or 2MB, whichever is smaller, if no buffer size is specified. Note that compare automatically puts both files into a binary mode for comparison.

The compare_text function operates identically to compare but takes as its third argument an optional code reference to an anonymous comparison subroutine. Unlike compare, compare_text compares files in text mode (assuming that the operating system draws a distinction), so without the third parameter, compare_text simply compares the two files in text mode.

The comparison subroutine, if supplied, should return a Boolean result that returns 0 if the lines should be considered equal and 1 otherwise. The default that operates when no explicit comparison is provided is equivalent to

sub {$_[0] ne $_[1]}

We can supply our own comparison subroutines to produce different results. For example, this comparison checks files for case-insensitive equivalence:

my $result = compare_text ($file1, $file2, sub {lc($_[0]) ne lc($_[1])});

Similarly, this comparison uses a named subroutine that strips extra whitespace from the start and end of lines before comparing them:

sub stripcmp {
    ($a, $b) = @_;
    $a =˜s/^s*(.*?)s*$/$1/;
    $b =˜s/^s*(.*?)s*$/$1/;
    return $a ne $b;
}
my $result = compare_text ($file1, $file2, &stripcmp);

For those who prefer more Unix-like nomenclature, cmp may be used as an alias for compare by importing it specifically:

use File::Compare qw(cmp);

Finding Files

The File::Find module provides a multipurpose file-finding subroutine that we can configure to operate in a number of different ways. It supplies one subroutine, find, which takes a first parameter of either a code or hash reference that configures the details of the search and one or more subsequent parameters defining the starting directory or directories to begin from. A second, finddepth, finds the same files as find but traverses them in order of depth. This can be handy in cases when we want to modify the file system as we go, as we will see later.

If the first parameter to either find or finddepth is a code reference, then it is treated as a wanted subroutine that tests for particular properties in the files found. Otherwise, it is a reference to a hash containing at least a wanted key and code reference value and optionally more of the key-value pairs displayed in Table 13-7.

Table 13-7. File::Find Configuration Fields

Key Value Description
wanted <code ref>

A reference to a callback subroutine that returns true or false depending on the characteristics of the file. Note that passing in a code reference as the first parameter is equivalent to passing

{wanted => $coderef}

Since find does not return any result, a wanted subroutine is required for find to do anything useful. The name is something of a misnomer, as the subroutine does not return a value to indicate whether a given file is wanted.

bydepth 0|1 A Boolean flag that when set causes files to be returned in order of depth. The convenience subroutine finddepth is a shorthand for this flag.
follow 0|1 A Boolean flag that when set causes find to follow symbolic links. When in effect, find records all files scanned in order to prevent files being found more than once (directly and via a link, for example) and to prevent loops (a link linking to its parent directory). For large directory trees, this can be very time consuming. For a faster but less rigorous alternative, use follow_fast. This option is disabled by default.
follow_fast 0|1 A Boolean flag that when set causes find to follow symbolic links. Like follow, follow_fast causes find to follow symbolic links. Unlike follow, it does not check for duplicate files, and so is faster. It still checks for loops, however, by tracking all symbolic links. This option is disabled by default.
follow_skip 0|1|2

A three-state flag that determines how find treats symbolic links if either follow or follow_fast is enabled:

A setting of 0 causes find to die if it encounters a duplicate file, link, or directory.

The default of 1 causes any file that is not a directory or symbolic link to be ignored if it is encountered again. A directory encountered a second time causes find to die.

A setting of 2 causes find to ignore both duplicate files or directories.

This flag has no effect if neither follow nor follow_fast is enabled.

no_chdir 0|1 A Boolean flag that when set causes find not to change down into each directory as it scans it. This primarily makes a difference to the wanted subroutine, if any is defined.
untaint 0|1 A Boolean flag that when set causes find to untaint directory names when running in taint (-T) mode. This uses a regular expression to untaint the directory names, which can be overridden with untaint_pattern.
untaint_pattern <pattern>

The pattern used to untaint directory names if untaint is enabled. The default pattern, which attempts to define all standard legal file name characters, is

qr/^([-+@w./]+)$/

If overridden, the replacement regular expression search pattern compiled with qr. In addition, it must contain one set of parentheses to return the untainted name and should probably be anchored at both ends.

Files with spaces inside the file name will fail unless this pattern is overridden. If multiple parentheses are used, then only the text matched by the first is used as the untainted name.

untaint_skip 0|1 A Boolean flag that when set causes find to skip over directories that fail the test against untaint_pattern. The default is unset, which causes find to die if it encounters an invalid directory name.

The following call to find searches for and prints out all files under /home, following symbolic links, untainting as it goes, and skipping over any directory that fails the taint check. At the same time, it pushes the files it finds onto an array to store the results of the search:

my @files;
find({
    wanted => sub {
        print $File::Find::fullname;
        push @files, $File::Find::fullname;
    },
    follow => 1, untaint => 1, untaint_skip => 1
}, '/home'),

The power of find lies in the wanted subroutine. find does not actually return any value, so without this subroutine the search will be performed but will not actually produce any useful result. In particular, no list of files is built automatically. We must take steps to store the names of files we wish to record within the subroutine if we want to be able to refer to them afterwards. While this is simple enough to do, the File::Find::Wanted module from CPAN augments File::Find and fixes this detail by providing a find_wanted subroutine. Used in place of find, it modifies the interface behavior of the wanted subroutine to return a Boolean value, which it uses to build a list of values when the return value is true. The list is then returned from find_wanted. To specify a wanted subroutine, we can specify a code reference to an anonymous subroutine (possibly derived from a named subroutine) either directly or as the value of the wanted key in the configuration hash. Each file that is located is passed to this subroutine, which may perform any actions it likes, including removing or renaming the file. For example, here is a simple utility script that renames all files in the target directory or directories using lowercase format:

#!/usr/bin/perl
# lcall.pl
use warnings;
use strict;

use File::Find;
use File::Copy;

die "Usage: $0 <dir> [<dir>...] " unless @ARGV;
foreach (@ARGV) {
    die "'$_' does not exist " unless -e $_;
}

sub lcfile {
    print "$File::Find::dir - $_ ";
    move ($_, lc $_);
}

finddepth (&lcfile, @ARGV);

In order to handle subdirectories correctly, we use finddepth so files are renamed first and the directories that contain them second. We also use the move subroutine from File::Copy, since this deals with both files and directories without any special effort on our part.

Within the subroutine, the variable $_ contains the current file name, and the variable $File::Find::dir contains the directory in which the file was found. If follow or follow_fast is in effect, then $File::Find::fullname contains the complete absolute path to the file with all symbolic links resolved to their true paths. If no_chdir has been specified, then $_ is the absolute pathname of the file, same as $File::Find::fullname; otherwise, it is just the leafname of the file.

If follow or follow_fast is set, then the wanted subroutine can make use of the results of the lstat that both these modes use. File tests can then use the special file name _ without any initial file test or explicit lstat. Otherwise, no stat or lstat has been done, and we need to use an explicit file test on $_. As a final example, here is a utility script that searches for broken links:

#!/usr/bin/perl
# checklink.pl
use warnings;
use strict;

use File::Find;

my $count = 0;

sub check_link {
    if (-l && !-e) {
       $count++;
       print " $File::Find::name is broken ";
    }
}

print "Scanning for broken links in ", join(', ', @ARGV), ": ";
find(&check_link, @ARGV);
print "$count broken links found ";

Note that it has to do both an explicit -l and -e to work, since one requires an lstat and the other a stat, and we do not get a free lstat because in this case as we do not want to follow symbolic links. (In follow mode, broken links are discarded before the wanted subroutine is called, which would rather defeat the point.)

Another way to create utilities like this is through the find2perl script, which comes as standard with Perl. This emulates the syntax of the traditional Unix find command, but instead of performing a search, it generates a Perl script using File::Find that emulates the action of the original command in Perl. Typically, the script is faster than find, and it is also an excellent way to create the starting point for utilities like the examples in this section. For example, here is find2perl being used to generate a script, called myfind.pl, that searches for and prints all files ending in .bak that are a week or more old, starting from the current directory:

> find2perl . -name '*.bak' -type f  -mtime +7 -print > myfind.pl

We don't need to specify the -print option in Perl 5.8 since it is now on by default, but it doesn't do any harm either. find2perl takes a lot of different options and arguments, including ones not understood by find, to generate scripts that have different outcomes and purposes such as archiving. This command is, however, a fairly typical example of its use. This is the myfind.pl script that it produces:

#! /usr/bin/perl -w
    eval 'exec /usr/bin/perl -S $0 ${1+"$@"}'
        if 0; #$running_under_some_shell

use strict;
use File::Find ();

# Set the variable $File::Find::dont_use_nlink if you're using AFS,
# since AFS cheats.

# for the convenience of &wanted calls, including -eval statements:
use vars qw/*name *dir *prune/;
*name   = *File::Find::name;
*dir    = *File::Find::dir;
*prune  = *File::Find::prune;

# Traverse desired file systems
File::Find::find({wanted => &wanted}, '.'),
exit;

sub wanted {
    my ($dev, $ino, $mode, $nlink, $uid, $gid);

    /^.*.bakz/s
    && (($dev, $ino, $mode, $nlink, $uid, $gid) = lstat($_))
    && -f _
    && (int(-M _) > 7)
    && print("$name ");
}

Often we want to make a record of the files that are of interest. Since the wanted subroutine has no way to pass back values to us, the caller, this means adding files to a global array or hash of some kind. Since globals are undesirable, this is an excellent opportunity to make use of a closure: a subroutine and a my-declared variable nested within a bare block. Here is an example:

#!/usr/bin/perl
# filefindclosure.pl
use strict;
use warnings;
use File::Find;

die "Give me a directory " unless @ARGV;

{ # closure for processing File::Find results
    my @results;

    sub wanted { push @results, $File::Find::name }
    sub findfiles {
        @results=();

        find &wanted, $_[0];
        return @results;
    }
}

foreach my $dir (@ARGV) {
    print("Error: $dir is not a directory "), next unless -d;
    my @files=findfiles($dir);
    print "$_ contains @files ";
}

For more recent versions of Perl, File::Find implements its own warnings category to issue diagnostics about any problems it encounters traversing the filing system, such as broken symbolic links or a failure to change to or open a directory. We might not find these warnings that helpful, so we can disable them (but leave all other warnings enabled) with

use warnings;
no warnings 'File::Find';

Deciphering File Paths

The File::Basename module provides subroutines to portably dissect file names. It contains one principal subroutine, fileparse, which attempts to divide a file name into a leading directory path, a basename, and a suffix:

use File::Basename;

# 'glob' all files with a three character suffix and parse pathname
foreach (</home/*/*.???>) {
    my ($path, $leaf, $suffix) = fileparse($_, '.w{3}'),
    ...
}

The path and basename are determined according to the file naming conventions of the underlying file system, as determined by the operating system or configured with fileparse_set_fstype. The suffix list, if supplied, provides one or more regular expressions, which are anchored at the end of the file name and tested. The first one that matches is used to separate the suffix from the basename. For example, to find any dot + three letter suffix, we can use .www, or as in the preceding example, .w{3}.

To search for a selection of specific suffixes, we can either supply a list or combine all combinations into a single expression. Which we choose depends only on which is more likely to execute faster:

fileparse ($filename, '.txt', '.doc'),   # list of suffixes
fileparse ($filename, '.(txt|doc));       # combined regular expression

fileparse ($filename, '.htm', '.html', .shtml);   # list of suffixes
fileparse ($filename, '.s?html?));        # combined regular expression

Remember when supplying suffixes that they are regular expressions. Dots in particular must be escaped if they are intended to mean a real dot (however, see the basename subroutine detailed next for an alternative approach).

In addition to fileparse, File::Basename supplies two specialized subroutines, basename and dirname, which return the leading path and the basename only:

my $path = dirname($filename);
my $leaf = basename($filename, @suffixes);

basename returns the same result as the first item returned by fileparse except that metacharacters in the supplied suffixes (if any) are escaped with Q...E before being passed to fileparse. As a result, suffixes are detected and removed from the basename only if they literally match:

# scan for .txt and .doc with 'fileparse'
my ($path, $leaf, $suffix) = fileparse($filename, '.(txt|doc)'),

Or:

# scan for .txt and .doc with 'basename'
my $leaf = basename($filename, '.txt', '.doc'),

dirname returns the same result as the second item returned by fileparse (the leading directory) on most platforms. For Unix and MSDOS, however, it will return . if there is no leading directory or a directory is supplied as the argument. This differs from the behavior produced by fileparse:

# scan for leading directory with 'fileparse'
print (fileparse('directory/file'),  # produce 'file'
print (fileparse('file')[1]);        # produce 'file'
print (fileparse('directory/')[1];   # produce 'directory/'

Or:

# scan for leading directory with 'dirname'
print dirname('directory/file'),     # produce 'file'
print dirname('file'),               # produce '.'
print dirname('directory/'),         # produce '.'

The file system convention for the pathname can be set to one of several different operating systems with the fileparse_set_fstype configuration subroutine. This can take one of the following case-insensitive values shown in Table 13-8, each corresponding to the appropriate platform.

Table 13-8. File::Basename File System Conventions

Value Platform
AmigaOS Amiga syntax
MacOS Macintosh (OS9 and earlier) syntax
MSWin32 Microsoft Windows long file names syntax
MSDOS Microsoft DOS short file names (8.3) syntax
OS2 OS/2 syntax
RISCOS Acorn RiscOS syntax
VMS VMS syntax

If the syntax is not explicitly set with fileparse_set_fstype, then a default value is deduced from the special variable $^O (or $OSNAME with use English). If $^O is none of the preceding file system types, Unix-style syntax is assumed. Note that if the pathname contains / characters, then the format is presumed to be Unix style whatever the file system type specified.

For a more comprehensive approach to portable file name handling, the low-level File::Spec module provides an interface to several different filing system and platform types. It is extensively used by other modules, including File::Basename and the File::Glob modules (and in fact most of the File:: family of modules). We do not usually need to use it directly because these other modules wrap its functionality in more purposeful and friendly ways, but it is useful to know it is there nonetheless. Specific filing system support is provided by submodules like File::Spec::Unix, File::Spec::Win32, and File::Spec::Mac. The correct module is used automatically to suit the platform of execution, but if we want to manage Macintosh file names on a Windows system, accessing the platform-specific module will give us the ability to do so.

Several functions of File::Spec are worth mentioning here, because they relate to the handling of pathnames. The module is an object-oriented one to allow it to be easily used in other file system modules, and so the functions are actually provided as methods, not subroutines—a functional but otherwise identical interface to the available subroutines is offered by the File::Spec::Functions module. None of the methods shown in Table 13-9 actually touch the filing system directly. Instead, they provide answers to questions like "Is this filing system case insensitive?" and "Is this an absolute or relative file name?"

Table 13-9. File::Spec Methods

Method Description
File::Spec->curdir() Return the native name for the current working directory—that is, . on most platforms. For the actual path, we need Cwd.
File::Spec->rootdir() Return the native name for the root directory. On Unix, that's /. On Windows and Mac, it depends on the currently active volume.
File::Spec->devnull() The name of the null device, for reading nothing or dumping output to nowhere. /dev/null on Unix, nul on Windows.
File::Spec->canonpath($path) Clean up the passed path into a canonical form, removing cruft like redundant . or trailing / elements appropriately. It does not remove .. elements—see File::Spec->no_upwards(@files).
File::Spec->updir() Return the native name for the parent directory—that is, .. on most platforms. For the actual path, Cwd and File::Basename are needed.
File::Spec->no_upwards(@files) Examine the list of files and remove upwards directory elements (typically ..—see File::Spec->updir()) along with the preceding directory element.
File::Spec->case_tolerant() Return true if the platform differentiates upper- and lowercase, false otherwise.
File::Spec->file_name_is_absolute() Return true if the file name is absolute on the current platform.
File::Spec->path() Return the current path, as understood by the underlying shell. This is the PATH environment variable for Unix and Windows, but it varies for other platforms.
File::Spec->rel2abs($path,$to) Return the absolute path given a relative path and, optionally, a base path to attach the relative path to. If not specified, the current working directory is used.
File::Spec->abs2rel($path,$from) The inverse of rel2abs, this takes an absolute path and derives the relative path from the optional base path supplied as the second argument. Again, the current working directory is used if only one argument is supplied.

In addition to these routines, we also have access to catfile, catdir, catpath, join, splitdir, splitpath, and tmpdir. With the exception of tmpdir, these are all involved in the construction or deconstruction of pathnames to and from their constituent parts. The File::Basename and File::Path modules provide a more convenient interface to most of this functionality, so we generally would not need to access the File::Spec methods directly. The tmpdir method returns the location of the system-supplied temporary directory, /tmp on most Unix platforms. It is used by modules that create temporary files, and we discuss it in more detail later on.

To call any of these methods, for example path, we can use either the object-oriented approach:

use File::Spec; # object-oriented
print File::Spec->path();

or use the equivalent functional interface:

use File::Spec::Functions; # functional
print path();

By default, File::Spec::Functions automatically exports canonpath, catdir, catfile, curdir, rootdir, updir, no_upwards, file_name_is_absolute, and path. We can choose to import all functions with :ALL or select individual functions in the usual way.

File Name Globbing

The majority of operating system shells support a wildcard syntax for specifying multiple files. For instance, *.doc means all files ending with .doc. Perl provides this same functionality through the file glob operator glob, which returns a list of all files that match the specified wildcard glob pattern:

my @files = glob '*.pod';   # return all POD documents in current directory

The glob pattern, not to be confused with a regular expression search pattern, accepts any pattern that would normally be accepted by a shell, including directories, wildcard metacharacters such as asterisks (zero-or-more), question marks (zero-or-one), and character classes. The following examples demonstrate the different kinds of glob operation that we can perform:

# match html files in document roots of all virtual hosts
my @html_files = glob '/home/sites/site*/web/*.html';
# match all files in current directory with a three-letter extension
my @three_letter_extensions = '*.???';
# match all files beginning with a to z
my @lcfirst = '[a-z]*';
# match 'file00 to file 49'
my @numbered_files = glob 'file[0-4][0-9]';
# match any file with a name of three or more characters
my @three_or_more_letter_files = glob '???*';

The order in which files are returned is by default sorted alphabetically and case sensitively (so uppercase trumps lowercase). We can alter this behavior by passing flags to the File::Glob module, which underlies glob, as well as allow more extended syntaxes than those in the preceding examples.

Before embarking on a closer examination of the glob function, keep in mind that while the underlying platform-specific glob modules do a good job of presenting the same interface and features, the opendir, readdir, and closedir functions are more reliable in cross-platform use, if more painstaking to use. This is particularly important with older versions of Perl (especially prior to version 5.6) where glob is less portable.

glob Syntax

The glob operator can be used with two different syntaxes. One, the glob built-in function, we have already seen:

my @files = glob '*.pl'  # explicit glob
The other is to use angle brackets in the style of the readline operator:
my @files = <*.pl>       # angle-bracket glob

How does Perl tell whether this is a readline or a glob? When Perl encounters an angle bracket construction, it examines the contents to determine whether it is a syntactically valid filehandle name or not. If it is, the operator is interpreted as a readline. Otherwise, it is handled as a file glob. Which syntax we use is entirely arbitrary. The angle bracket version looks better in loops, but it resembles the readline <> operator, which can create ambiguity for readers of the code:

foreach (<*.txt>) {
    print "$_ is not a textfile!" if !-T;
}

One instance we might want to use glob is when we want to perform a file glob on a pattern contained in a variable. A variable between angle brackets is ambiguous, so at compile time Perl guesses it is a readline operation. We can insert braces to force Perl to interpret the expression as a file glob, but in these cases it is often simpler to use glob instead:

@files = <$filespec>;     # ERROR: attempts to read lines
@files = <${filespec}>;   # ok, but algebraic
@files = glob $filespec;  # better

The return value from the globbing operation is a list containing the names of the files that matched. Files are matched according to the current working directory if a relative pattern is supplied; otherwise, they are matched relative to the root of the file system. The returned file names reflect this too, incorporating the leading directory path if one was supplied:

@files = glob '*.html';   # relative path
@files = glob '/home/httpd/web/*.html';   # absolute path

glob combines well with file test operators and array processing functions like map and grep. For example, to locate all text files in the current directory, we can write

my @textfiles = grep {-f && -T _} glob('*'),

The glob function does not recurse, however. To do the same thing over a directory hierarchy, we can use the File::Find module with a wanted subroutine containing something similar:

sub wanted {
    push @textfiles, $File::Find::name if -f && -T _;
}

The glob operator was originally a built-in Perl function, but since version 5.6 it is implemented in terms of the File::Glob module, which implements Unix-style file globbing and overrides the built-in core glob. An alternative module, File::DosGlob, implements Windows/DOS-style globbing, with some extensions.

Unix-Style File Globbing

The standard glob does file globbing in the style of Unix, but it will still work on other platforms. The forward slash is used as a universal directory separator in patterns and will match matching files on the file system irrespective of the native directory separator. On Windows/DOS systems, the backslash is also accepted as a directory separator.

We automatically trigger use of the File::Glob module whenever we make use of the glob operator in either of its guises, but we can modify and configure the operator more finely by using the module directly. File::Glob defines four import tags that can be imported to provide different features, listed in Table 13-10.

Table 13-10. File::Glob Import Tags

Label Function
:glob Import symbols for the flags of glob's optional flag argument. See Table 13-11 for a list and description of each flag.
:case Treat the file glob pattern as case sensitive. For example, *.doc will match file.doc but not file.DOC.
:nocase Treat the file glob pattern as case insensitive. For example, *.doc will match both file.doc and file.DOC.
:globally

Override the core glob function. From Perl 5.6 this happens automatically. This will also override a previous override, for example, by File::DosGlob.

For example, to import the optional flag symbols and switch the file globbing operator to a case-insensitive mode, we would write

use File::Glob qw(:glob :nocase);

If not explicitly defined, the case sensitivity of glob is determined by the underlying platform (as expressed by the special variable $^O). The :case and :nocase labels allow us to override this default. For individual uses, temporary case sensitivity can be controlled by passing a flag to the glob operator instead, as we will see next.

Extended File Globbing

The glob operator accepts a number of optional flags that modify its behavior. These flags are given as a second parameter to glob and may be bitwise ORed together to produce multiple effects. To import a set of constants to name the flags, use File::Glob, explicitly specifying the :glob label:

use File::Glob qw(:glob);

The core glob function takes only one argument, a prototype, which is still enforced even though it is now based on a two-argument subroutine. To supply flags, we call the glob subroutine in the File::Glob package, where the prototype does not apply. For example, to enable brace expansions and match case insensitively, we would use

my @files = File::Glob::glob $filespec, GLOB_BRACE|GLOB_NOCASE;

The full list of flags is displayed in Table 13-11.

Table 13-11. File::Glob Operator Flags

Flag Function
GLOB_ALPHASORT Along with GLOB_NOSORT, this flag alters the order in which files are returned. By default, files are returned in case-sensitive alphabetical order. GLOB_ALPHASORT causes an alphabetical sort, but case-insensitively, so upper- and lowercase file names are adjacent to each other. See also GLOB_NOSORT.
GLOB_BRACE

Expand curly braces. A list of alternatives separated by commas is placed between curly braces. Each alternative is then expanded and combined with the rest of the pattern. For example, to match any file with an extension of .exe, .bat, or .dll, we could use

my @files = *.{exe, bat, dll}

Likewise, to match Perl-like files:

my @perl_files = *.{pm, pl, ph, pod}

See also "DOS-Style File Globbing" later for an alternative approach.

GLOB_CSH

File globbing in the style of the Unix C Shell csh. This is a combination of all four of the FreeBSD glob extensions for convenience:

GLOB_BRACE|GLOB_NOMAGIC|GLOB_QUOTE|GLOB_TILDE
GLOB_ERR Cause glob to return an error if it encounters an error such as a directory that it cannot open. Ordinarily glob will pass over errors. See "Handling Globbing Errors" for details.
GLOB_LIMIT Cause glob to return a GLOB_NOSPACE error if the size of the expanded glob exceeds a predefined system limit, typically defined as the maximum possible command-line argument size.
GLOB_MARK Return matching directories with a trailing directory separator /.
GLOB_NOCASE Perform case-insensitive matching. The default is to assume matches are case sensitive, unless glob detects that the underlying platform does not handle case-sensitive file names, as discussed earlier. Note that the :case and :nocase import labels override the platform-specific default, and GLOB_NOCASE then applies on a per-glob basis.
GLOB_NOCHECK Return the glob pattern if no file matches it. If GLOB_QUOTE is also set, the returned pattern is processed according to the rules of that flag. See also GLOB_NOMAGIC.
GLOB_NOMAGIC As GLOB_NOCHECK, but the pattern is returned only if it does not contain any of the wildcard characters *, ?, or [.
GLOB_NOSORT Disable sorting altogether and return files in the order in which they were found for speed. It overrides GLOB_ALPHASORT if both are specified. See also GLOB_ALPHASORT.
GLOB_QUOTE Treat backslashes, , as escape characters and interpret the following character literally, ignoring any special meaning it might normally have. On DOS/Windows systems, backslash only escapes metacharacters and is treated as a directory separator otherwise. See also "DOS-Style File Globbing" later for an alternative approach.
GLOB_TILDE Expand the leading tilde, ˜, of a pattern to the user home directory. For example, ˜/.myapp/config might be expanded to /home/gurgeh/.myapp/config.

Handling Globbing Errors

If glob encounters an error, it puts an error message in $! and sets the package variable File::Glob::GLOB_ERROR to a non-zero value with a symbolic name defined by the module:

GLOB_NOSPACE Perl ran out of memory.
GLOB_ABEND Perl aborted due to an error.

If the error occurs midway through the scan, and some files have already been found, then the incomplete glob is returned as the result. This means that getting a result from glob does not necessarily mean that the file glob completed successfully. In cases where this matters, check $File::Glob::GLOB_ERROR:

@files = glob $filespec;
if ($File::Glob::GLOB_ERROR) {
    die "Error globbing '$filespec': $! ";
}

DOS-Style File Globbing

DOS-style file globbing is provided by the File::DosGlob module, an alternative to File::Glob that implements file globs in the style of Windows/DOS, with extensions. In order to get DOS-style globbing, we must use this module explicitly, to override the Unix-style globbing that Perl performs automatically (for instance, if we are running on a Windows system, we may receive wildcard input from the user that conforms to DOS rather than Unix style):

use File::DosGlob;            # provide File::DosGlob::glob
use File::DosGlob qw(glob);   # override core/File::Glob's 'glob'

Unlike File::Glob, File::DosGlob does not allow us to configure aspects of its operation by specifying labels to the import list, and it does not even override the core glob unless explicitly asked, as shown in the second example earlier. Even if we do not override glob, we can call the File::DosGlob version by naming it in full:

@dosfiles = File::DosGlob::glob ($dosfilespec);

Even with glob specified in the import list, File::DosGlob will only override glob in the current package. To override it everywhere, we can use GLOBAL_glob:

use File::DosGlob qw(GLOBAL_glob);

This should be used with extreme caution, however, since it might upset code in other modules that expects glob to work in the Unix style.

Unlike the DOS shell, File::DosGlob works with wildcarded directory names, so a file spec of C:/*/dir*/file* will work correctly (although it might take some time to complete).

my @dosfiles = glob ('mydosfilepath*.txt'),  # single quoted

The module also understands DOS-style backslashes as directory separators, although these may need to be protected:

my @dosfiles = <my\dos\filepath\*.txt>;      # escaped

Any mixture of forward and backslashes is acceptable to File::DosGlob's glob (and indeed Perl's built-in one, on Windows); translation into the correct pattern is done transparently and automatically:

my @dosfiles = <my/dos/filepath\*.txt>;       # a mixture

To search in file names or directories that include spaces, we can escape them using a backslash (which means that we must interpolate the string and therefore protect literal backslashes):

my @programfiles = <C:\Program Files\*.*>;

If we use the glob literally, we can also use double quotes if the string is enclosed in single quotes (or the q quoting operator):

my @programfiles = glob 'C:/"Program Files"/*.*';

This functionality is actually implemented via the Text::ParseWords module, covered in Chapter 19.

Finally, multiple glob patterns may be specified in the same pattern if they are separated by spaces. For example, to search for all .exe and .bat files, we could use

my @executables = glob('*.exe *.bat'),

Temporary Files

There have always been two basic approaches for creating temporary files in Perl, depending on whether we just want a scratchpad that we can read and write or want to create a temporary file with a file name that we can pass around. To do the first, we can create a filehandle with IO::File that points to a temporary file that exists only so long as the filehandle is open. To do the second, we can deduce the name of a unique temporary file and then open and close it like an ordinary file, using the POSIX tmpnam function.

From Perl 5.6.1, we have a third approach that involves using File::Temp, which returns both a file name and a filehandle. From Perl 5.8, we have a fourth, an anonymous temporary file that we can create by passing a file name of undef to the built-in open function. This is essentially the same as the first approach, but using a new native syntax. We covered anonymous temporary files in the last chapter, so here we will examine the other three approaches.

Creating a Temporary Filehandle

Temporary filehandles can be created with the new_tmpfile method of the IO::File module. new_tmpfile takes no arguments and opens a new temporary file in read-update (and binary, for systems that care) mode, returning the generated filehandle. In the event of an error, undef is returned and $! is set to indicate the reason. For example:

my $tmphandle = IO::File->new_tmpfile();
unless ($tmphandle) {
   print "Could not create temporary filehandle: $! ";
}

Wherever possible, the new_tmpfile method accesses the operating system tmpfile library call (on systems that provide it). This makes the file truly anonymous and is the same interface provided by open in sufficiently modern versions of Perl. On these generally Unix-like systems, a file exists as long as something is using it, even if it no longer has a file name entered in the file system. new_tmpfile makes use of this fact to remove the file system entry for the file as soon as the file-handle is created, making the temporary file truly anonymous. When the filehandle is closed, the file ceases to exist, since there will no longer be any references to it. This behavior is not supported on platforms that do not support anonymous temporary files, but IO::File will still create a temporary file for us. See Chapter 12 for more information on filehandles and temporary anonymous files.

Temporary File Names via the POSIX Module

While IO::File's new_tmpfile is very convenient for a wide range of temporary file applications, it does not return us a file name that we can use or pass to other programs. To do that, we need to use the POSIX module and the tmpnam routine. Since POSIX is a large module, we can import just tmpnam with

use POSIX qw(tmpnam);

The tmpnam routine takes no arguments and returns a temporary file name guaranteed to be unique at the moment of inquiry. For example:

my $tmpname = tmpnam();
print $tmpname;   # produces something like '/tmp/fileV9vJXperl'

File names are created with a fixed and unchangeable default path, defined by the P_tmpdir value given in the C standard library's studio.h header file. This path can be changed subsequently, but this does not guarantee that the file does not exist in the new directory. To do that, we might resort to a loop like this:

do {
    my $tmpname = tmpnam();
    $tmpname =˜ m|/ ([^/]+) $| && $tmpname = $1;   # strip '/tmp'
    $tmpname = $newpath.$tmpname;   # add new path
} while (-e $tmpname);

This rather defeats the point of tmpnam, however, which is to create a temporary file name quickly and easily in a place that is suitable for temporary files (/tmp on any vaguely Unix-like system). It also does not handle the possibility that other processes might be trying to create temporary files in the same place. This is a significant possibility and a potential source of race conditions. Two processes may call tmpnam at the same time, get the same file name in return, then both open it. To avoid this, we open the temporary file using sysopen and specify the O_EXCL flag, which requires that the file does not yet exist. Here is a short loop that demonstrates a safe way to open the file:

# get an open (and unique) temporary file
do {
    my $tmpname = tmpnam();
    sysopen TMPFILE, $tmpname, O_RDWR|O_CREAT|O_EXCL;
} until (defined fileno(TMPFILE));

If another process creates the same file in between our call to tmpnam and the sysopen, the O_EXCL will cause it to fail; TMPFILE will not be open, and so the loop repeats (see the next section for a better approach). Note that if we only intend to write to the file, O_WRONLY would do just as well, but remember to import the symbols from the POSIX or Fcntl modules. Once we have the file open, we can use it:

# place data into the file
print TMPFILE "This is only temporary ";
close TMPFILE;

# use the file - read it, write it some more, pass the file name to another
# process, etc.

# remember to tidy up afterwards!
unlink $tmpname;

Since we have an actual tangible file name, we can pass it to other processes. This is a common approach when reading the output of another command created with a piped open. For example, here is an anonymous FTP command-line client, which we can use to execute commands on a remote FTP server:

#!/usr/bin/perl -w
# ftpclient.pl
use warnings;
use strict;
use POSIX qw(O_RDWR O_CREAT O_EXCL tmpnam);
use Sys::Hostname; # for 'hostname'

die "Simple anonymous FTP command line client ".
    "Usage: $0 <server> <command> " unless scalar(@ARGV)>=2;

my ($ftp_server,@ftp_command)=@ARGV;

# get an open and unique temporary file
my $ftp_resultfile;
do {
    # generate a new temporary file name
    $ftp_resultfile = tmpnam();

    # O_EXCL ensures no other process successfully opens the same file
    sysopen FTP_RESULT, $ftp_resultfile, O_RDWR|O_CREAT|O_EXCL;
    # failure means something else opened this file name first, try again
} until (defined fileno(FTP_RESULT));

# run ftp client with autologin disabled (using -n)
if (open (FTP, "|ftp -n > $ftp_resultfile 2>&1")) {
    print "Client running, sending command ";

    # command: open connection to server
    print FTP "open $ftp_server ";
    # command: specify anonymous user and email as password
    my $email=getlogin.'@'.hostname;
    print FTP "user anonymous $email ";
    # command: send command (interpolate list to space arguments)
    print FTP "@ftp_command ";

    close FTP;
} else {
    die "Failed to run client: $! ";
}

print "Command sent, waiting for response ";
my @ftp_results = <FTP_RESULT>;
check_result(@ftp_results);
close FTP_RESULT;
unlink $ftp_resultfile;
print "Done ";

sub check_result {
     return unless @_;

     print "Response: ";
     # just print out the response for this example
     print " $_" foreach @_;
}

We can use this (admittedly simplistic) client like this:

$ ftpclient.pl ftp.alphacomplex.com get briefing.doc

Using File::Temp

As of Perl 5.6.1, we have a better approach to creating temporary files, using the File::Temp module. This module returns the name and filehandle of a temporary file together. This eliminates the possibility of a race condition. Instead of using sysopen with the O_EXCL flag, as we showed in the previous section, File::Temp provides us with the following much simpler syntax using its tempfile function:

my ($FILEHANDLE, $filename) = tempfile();

However, tempfile can take arguments that we can use to gain more control over the created temporary file, as shown in the following:

my ($FILEHANDLE, $filename) = tempfile($template, DIR => $dir, SUFFIX = $suffix);

The template should contain at least four trailing Xs, which would then be replaced with random letters, so $template could be something like filenameXXXXX. By specifying an explicit directory with DIR, we can specify the directory where we want the temporary file to be created. Otherwise, the file will be created in the directory specified for temporary files by the function tmpdir in File::Spec.

Finally, at times we might need our temporary file to have a particular suffix, possibly for subsequent processing by other applications. The following will create a temporary file called fileXXXX.tmp (where the four Xs are replaced with four random letters) in the directory /test/files:

my ($FILEHANDLE, $filename) = tempfile("fileXXXX", DIR => "/test/files",
                                                   SUFFIX => ".tmp");

However, the recommended interface is to call tempfile in scalar instead of list context, returning only the filehandle:

my $FILEHANDLE = tempfile("fileXXXX", DIR => "/test/files", SUFFIX => ".tmp");

The file itself will be automatically deleted when closed. No way to tamper with the file name means no possibility of creating a race condition.

To create temporary directories, File::Temp provides us with the tempdir function. Using the function without argument creates a temporary directory in the directory set by tmpdir in File::Spec:

my $tempdir = tempdir();

As with tempfile, we can specify a template and explicit directory as arguments to tempdir. Here also the template should have at least four trailing Xs that will be translated into four random letters. The DIR option overrides the value of File::Spec's tmpdir:

my $tempdir = tempdir("dirXXXX", DIR => "/test/directory");

This will create a temporary directory called something like /test/directory/dirdnar, where dnar are four random letters that replaced the four Xs. If the template included parent directory specifications, then they are removed before the directory is prepended to the template. In the absence of a template, the directory name is generated from an internal template.

Removing the temporary directory and all its files, whether created by File::Temp or not, can be achieved using the option CLEANUP => 1.

In addition to the functions tempfile and tempdir, File::Temp provides Perl implementations of the mktemp family of temp file generation system calls. These are shown in Table 13-12.

Table 13-12. File::Temp Functions

Funtion Description
mkstemp

Using the provided template, this function returns the name of the temporary file and a filehandle to it:

my ($HANDLE, $name) = mkstemp($template);

If we are interested only in the filehandle, then we can use mkstemp in scalar context.

mkstemps

This is similar to mkstemp but accepts the additional option of a suffix that is appended to the template:

my ($HANDLE, $name) = mkstemps($template, $suffix);
mktemp

This function returns a temporary file name but does not ensure that the file will not be opened by a different process:

my $unopened = mktemp($template);
Mkdtemp

This function uses the given template to create a temporary directory. The name of the directory is returned upon success and undefined otherwise:

my $dir = mktemp($template);

Finally, the File::Temp module provides implementations of the POSIX's tmpname and tmpfile functions. As mentioned earlier, POSIX uses the value of P_tmpdir in the C standard library's studio.h header file as the directory for the temporary file. File::Temp, on the other hand, uses the setting of tmpdir. With a call to mkstemp using an appropriate template, tmpname returns a filehandle to the open file and a file name:

my ($HANDLE, $name) = tmpname();

In scalar context, tmpname uses mktemp and returns the full name of the temporary file:

my $name = tmpname();

While this ensures that the file does not already exist, it does not guarantee that this will remain the case. In order to avoid a possible race condition, we should use tmpname in list context.

The File::Temp implementation of the POSIX's tmpfile returns the filehandle of a temporary file. There is no access to the file name, and the file is removed when the filehandle is closed or when the program exits:

my $HANDLE = tmpfile();

For further information on File::Temp, consult the documentation.

Querying and Manipulating Directories

Directories are similar to files in many ways; they have names, permissions, and (on platforms that support it) owners. They are significantly different in others ways, however. At their most basic, files can generally be considered to be content, that is, data. Directories, on the other hand, are indices of metadata. Record based, each entry in a directory describes a file, directory, link, or special file that the directory contains. It only makes sense to read a directory in terms of records and no sense at all to write to the directory index directly—the operating system handles that when we manipulate the contents.

Accordingly, operating systems support a selection of functions specifically oriented to handling directories in a record-oriented context, which Perl wraps and makes available to us as a collection of built-in functions with (reasonably) platform-independent semantics. They provide a more portable but lower-level alternative to the glob function discussed earlier in the chapter.

Directories can also be created and destroyed. Perl supports these operations through the functions mkdir and rmdir, which should be synonymously familiar to those with either a Windows or a Unix background. For more advanced applications, the File::Path module provides enhanced directory-spanning analogues for these functions.

A discussion of directories is not complete without the concept of the current working directory. All of Perl's built-in functions that take a file name as an argument, from open to the unary file test operators, base their arguments relative to the current working directory whenever the given file name is not absolute. We can both detect and change the current working directory either using Perl's built-in functions or with the more flexible Cwd module.

Reading Directories

Although directories cannot be opened and read like ordinary files, the equivalent is possible using directory handles. For each of the file-based functions open, close, read, seek, tell, and rewind, there is an equivalent that performs the same function for directories. For example, opendir opens a directory and returns a directory handle:

opendir DIRHANDLE, $dirname;

Although similar to filehandles in many respects, directory handles are an entirely separate subspecies; they only work with their own set of built-in functions and even occupy their own internal namespace within a typeglob, so we can quite legally have a filehandle and a directory handle with the same name. Having said that, creating a filehandle and a directory handle with the same name is more than a little confusing.

If opendir fails for any reason (the obvious ones being that the directory does not exist or is in fact a file), it returns undef and sets $! to indicate the reason. Otherwise, we can read the items in the directory using readdir:

if (opendir DIRHANDLE, $dirname) {
    print "$dirname contains: $_ " foreach readdir DIRHANDLE;
}

readdir is similar in spirit to the readline operator, although we cannot use an equivalent of the <> syntax to read from a directory filehandle. If we do, Perl thinks we are trying to read from a filehandle with the same name. However, like the readline operator, readdir can be called in either a scalar context, where it returns the next item in the directory, or in a list context, where it returns all remaining entries:

my $diritem  = readdir DIRHANDLE;  # read next item
my @diritems = readdir DIRHANDLE;  # read all (remaining) items

(Another example of list context is the foreach in the previous example.)

Rather than return a line from a file, readdir returns a file name from the directory. We can then go on to test the file name with file test operators or stat/lstat to find out more about them. However, if we do this, we should take care to append the directory name first or use chdir; otherwise, the file test will not take place where we found the file but in the current working directory:

opendir DIRHANDLE, '..';   # open parent directory
foreach (readdir DIRHANDLE) {
    print "$_ is a directory   " if -d "../$_";
}
closedir DIRHANDLE;

Or, using chdir:

opendir DIRHANDLE, '..';   # open parent directory
chdir '..';   # change to parent directory
foreach (readdir DIRHANDLE) {
    print "$_ is a directory " if -d;   # use $_
}
closedir DIRHANDLE;

Note that when finished with a directory handle, it should be closed, again using a specialized version of close, closedir. In the event closedir fails, it also returns undef and sets $! to indicate the error. Otherwise, it returns true.

Directory Positions

Directory filehandles also have positions, which can be manipulated with the functions seekdir, telldir, and rewinddir, direct directory analogues for the file position functions seek, tell, and rewind. Keep in mind that the former set of functions only work on directories (the plain file counterparts also work on directories, but not very usefully), and a directory position set with seekdir must be deduced from telldir, in order to know what positions correspond to the start of directory entries:

# find current position of directory handle
my $dpos = telldir DIRHANDLE;
# read an item, moving the position forward
my $item = readdir DIRHANDLE;
# reset position back to position read earlier

seekdir DIRHANDLE, $dpos;
# reset position back to start of directory
rewinddir DIRHANDLE;

Although they are analogous, these functions are not as similar to their file-based counterparts as their names might imply. In particular, seekdir is not nearly as smart as seek, because it does not accept an arbitrary position. Instead, seekdir is only good for setting the position to 0, or a position previously found with telldir.

Directory Handle Objects

As an alternative to the standard directory handling functions, we can instead use the IO::Dir module. IO::Dir inherits basic functionality from IO::File, then overloads and replaces the file-specific features with equivalent methods for directories.

my $dirh = new IO::Dir($directory);

Each of the standard directory handling functions is supported by a similarly named method in IO::Dir, minus the trailing dir. Instead of using opendir, we can create a new, unassociated IO::Dir object and then use open:

my $dirh = new IO::Dir;
my $dirh->open ($directory);

Likewise, we can use read to read from a directory filehandle, seek, tell, and rewind to move around inside the directory, and close to close it again:

my $entry = $dirh->read;    # read an entry
my $dpos = $dirh->tell;     # find current position
$dirh->seek($dpos);         # set position
$dirh->rewind;              # rewind to start
my @entries = $dirh->read;  # read all entries

Directories As Tied Hashes

As an alternative to the object-oriented interface, IO::Dir also supports a tied hash interface, where the directory is represented by a hash and the items in it as the keys of the hash. The values of the hash are lstat objects created via the File::stat package, called on the key in question. These are created at the moment that we ask for it so as not to burden the system with unnecessary lstat calls. If the main purpose of interrogating the directory is to perform stat-type operations (including file tests), we can save time by using this interface:

# list permissions of all files in current directory
my %directory;
tie %directory, IO::Dir, '.';

foreach (sort keys %directory) {
    printf ("$_ has permissions %o ", $directory{$_}->mode & 0777);
}
untie %directory;

IO::Dir makes use of the tied hash interface to extend its functionality in other ways too. Assigning an integer as the value of an existing key in the hash will cause the access and modification time to be changed to that value. Assigning a reference to an array of two integers will cause the access and modification times to be altered to the first and second, respectively. If, on the other hand, the entry does not exist, then an empty file of the same name is created in the directory, again with the appropriate timestamps:

# set all timestamps to the current time:
@Code:my $now = time;

foreach (keys %directory) {
    $directory{$_} = $now;
}

# create a new file, modified one day ago, accessed now:
$directory{'newfile'} = [$now, $now-24 * 60 * 60];

Deleting a key-value pair will also delete a file, but only if the option DIR_UNLINK is passed to the tie as a fourth parameter:

# delete backup files ending in .bak or ˜
tie %directory, IO::Dir, $dirname, DIR_UNLINK;

foreach (keys %directory) {
   delete $directory{$_} if /(.bak|˜)$/;
}
untie %directory;

With DIR_UNLINK specified, deleting an entry from the hash will either call unlink or rmdir on the items in question, depending on whether it is a file or a directory. In the event of failure, the return value is undef and $! is set to indicate the error, as usual.

Finding the Name of a Directory or File from Its Handle

As a practical example of using the directory functions, the following example is a solution to the problem of finding out the name of a directory or file starting from a handle, assuming we know the name of the parent directory:

sub find_name {
    my ($handle, $parentdir) = @_;

    # find device and inode of directory
    my ($dev, $ino) = lstat $handle;
    open PARENT, $parentdir or return;
    foreach (readline PARENT) {
        # find device and node of parent directory entry
        my ($pdev, $pino) = lstat '../$_';
        # if it is a match, we have our man
        close PARENT, return $_ if ($pdev == $dev && $pino == $ino);
    }
    close PARENT;
    return;   # didn't find it...strange!
}

my $name = find_name (*HANDLE, "/parent/directory");
close HANDLE;

First, we use lstat to determine the device and inode of the parent directory—or possibly the symbolic link that points to the directory, which is why we use lstat and not stat. We then open the parent and scan each entry in turn using lstat to retrieve its device and inode. If we find a match, we must be talking about the same entry, so the name of this entry must be the name of the file or directory (or a name, on Unix-like platforms where multiple names can exist for the same file).

We can adapt this general technique to cover whole file systems using the File::Find module, though if we plan to do this a lot, caching the results of previous lstat commands will greatly improve the run time of subsequent searches.

Creating and Destroying Directories

The simplest way to create and destroy directories is to use the mkdir and rmdir functions. These both create or destroy a single directory, starting at the current working directory if the supplied name is relative. For more advanced applications, we can use the File::Path module, which allows us to create and destroy multiple nested directories.

Creating Single Directories

To create a new directory, we use the built-in mkdir function. This takes a directory name as an argument and attempts to create a directory with that name. The pathname given to mkdir may contain parent directories, in which case they must exist for the directory named as the last part of the pathname to be created. If the name is absolute, it is created relative to the root of the filing system. If it is relative, it is created relative to the current working directory:

# relative - create directory 'scripts' in current working directory
mkdir 'scripts';
# absolute - create 'web' in /home/httpd/sites/$site, which must already exist
mkdir "/home/httpd/sites/$site/web";
# relative - create directory 'scripts' in subdirectory 'lib' in current
# working directory. POSSIBLE ERROR: 'lib' must already exist to succeed.
mkdir 'lib/scripts';

mkdir may be given an optional second parameter consisting of a numeric permissions mask, as described earlier in the chapter. This is generally given as an octal number specifying the read, write, and execute permissions for each of the user, group, and other categories. For example, to create a directory with 755 permissions, we would use

mkdir $dirname, 0755;

We can also use the mode symbols from the Fcntl module if we import them first. Here is an example of creating a directory with 0775 permissions, using the appropriate Fcntl symbols:

use Fcntl qw(:mode);
# $dirname with 0775 permissions
mkdir $dirname, S_RWXU | S_RWXG | S_ROTH | S_XOTH;

The second parameter to mkdir is a permissions mask, also known as umask, not a generic file mode. It applies to permissions only, not the other specialized mode bits such as the sticky, setuid, or setgid bits. To set these (on platforms that support them), we must use chmod after creating the directory.

The setting of umask may also remove bits; it is merged with the permissions mask parameter to define the value supplied to mkdir. The default permissions mask is 0777 modified by the umask setting. (A umask setting of octal 022 would modify the stated permissions of a created directory from 0777 to 0755, for example.) This is generally better than specifying a more restricted permissions mask in the program as it allows permissions policy to be controlled by the user.

Creating Multiple Directories

The mkdir function will only create one directory at a time. To create multiple nested directories, we can use the File::Path module instead.

File::Path provides two routines, mkpath and rmtree. mkpath takes a path specification containing one or more directory names separated by a forward slash, a Boolean flag to enable or disable a report of created directories, and a permissions mask in the style of mkdir. It is essentially an improved mkdir, with none of the drawbacks of the simpler function. For example, to create a given directory path:

use File::Path;

# create path, reporting all created directories
my $verbose = 1;
my $mask = 0755;
mkpath ('/home/httpd/sites/mysite/web/data/', $verbose, $mask);

One major advantage mkpath has over mkdir is that it handles preexisting directories in stride, using them if present and creating new directories otherwise. It also handles directory naming conventions of VMS and OS/2 automatically. In other respects, it is like mkdir, using the same permission mask and creating directories from the current working directory if given a relative pathname:

# silently create scripts in lib, creating lib first if it does not exist.
mkpath "lib/scripts";

If mkpath is only given one parameter, as in the preceding example, the verbose flag defaults to 0, resulting in a silent mkpath. And like mkdir, the permissions mask defaults to 0777.

mkpath can also create multiple chains of directories if its first argument is a list reference rather than a simple scalar. For instance, to create a whole installation tree for a fictional application, we could use something like this:

mkpath ([
   '/usr/local/apps/myapp/bin',
   '/usr/local/apps/myapp/doc',
   '/usr/local/apps/myapp/lib',
], 1, 0755);

In the event of an error, mkpath will croak and return with $! set to the reason of the failed mkdir. To trap a possible croak, put the mkpath into an eval:

unless (defined eval {mkpath(@paths, 0, 0755)}) {
   print "Error from mkpath: $@ ($!) ";
}

Otherwise, mkpath returns the list of all directories created. If a directory already exists, then it is not added to this list. As any return from mkpath indicates that the call was successful overall, an empty list means simply that all the directories requested already exist. Since we often do not care if directories were created or not, just so long as they exist, we usually do not actually check the return value, only trap the error as in the preceding example.

Destroying Single Directories

To delete a directory, we use the rmdir function, which returns 1 on success and 0 otherwise, setting $! to indicate the reason for the error. rmdir takes a single directory name as an argument or uses the value of $_ if no file name is given:

rmdir $dirname;   # remove dirname

rmdir;   # delete directory named by $_

rmdir typically fails if the given name is not a valid pathname or does not point to a directory (it might be a file or a symbolic link to a directory). It will also fail if the directory is not empty.

Deleting nested directories and directories with contents is more problematic. If we happen to be on a Unix system, logged in as superuser, and if we specified the -U option to Perl when we started our application, then we can use unlink to remove the directory regardless of its contents. In general, however, the only recourse we have is to traverse the directory using opendir, removing files and traversing into subdirectories as we go. Fortunately, we do not have to code this ourselves, as there are a couple of modules that will greatly simplify the process.

Destroying Multiple or Nonempty Directories

As well as mkpath, the File::Path module provides a second routine, rmtree, that performs (loosely speaking) the opposite function.

rmtree takes three parameters: the first, like mkpath, is a single scalar directory path. It comprises one or more directories separated by forward slashes, or alternatively a reference to an anonymous array of scalar directory paths. Paths may be either absolute or relative to the current working directory.

The second is, just like mkpath, a Boolean verbosity flag, set to false by default. If enabled, rmtree reports on each file or directory it encounters, indicating whether it used unlink or rmdir to remove it, or whether it skipped over it. Symbolic links are deleted but not followed.

The third parameter is a safety flag, also Boolean and false by default. If true, rmtree will skip over any file for which the program does not have write permission (or more technically, the program's effective user ID does not have write permission), except for VMS, which has the concept of "delete permission." Otherwise, it will attempt to delete it anyway, which depends not on the file's permissions or owner but on the permissions of the parent directory, like rmdir.

Consider the following simple script, which simply wraps rmtree:

#!/usr/bin/perl
# rmtree.pl
use strict;
use warnings;
use File::Path;

my $path=$ARGV[0];

my $verbose = 0;
my $safe = 1;
rmtree $path, $verbose, $safe;

With an array reference instead of a scalar pathname, all the paths in the array are deleted. We can remove $path from the preceding script and replace all of the script below it with

# remove all paths supplied, silently and safely.
rmtree(@ARGV, 0, 1);

On success, rmtree returns the number of files deleted. On a fatal error, it will croak like mkpath and can be trapped in the same way. Other, nonfatal, errors are carped (via the Carp module) and must be trapped by a warning signal handler:

$SIG{__WARN__} = handle_warnings();

If the safety flag is not set, rmtree attempts to force the permissions of file directories to make them deletable. In the event of it failing to delete them afterwards, it may also be unable to restore the original permissions, leading to potentially insecure permissions. In all such cases, the problem will be reported via carp and trapped by the warning signal handler if present.

Finding and Changing the Current Directory

All of Perl's directory handling functions from opendir to rmdir understand both absolute and relative pathnames. Relative is in relation to the current working directory, which initially is the directory that the shell was in when it started our application. Desktop icons, for example, provided by Windows shortcuts, supply the ability to specify the working directory before running the program the shortcut points to. Perl programs started from other processes inherit the current working directory, or CWD for short, of the parent. In a command shell, we commonly find the cd command changes the current working directory.

We can change the current working directory in Perl with the chdir function. chdir takes a directory path as its argument and attempts to change the current working directory accordingly. If the path is absolute, it is taken relative to the root directory; otherwise, it is taken relative to the current working directory. It returns true on success and false on failure. For example:

unless (chdir $newdir) {
    "Failed to change to $newdir: $! ";
}

Without an argument, chdir changes to the home directory, equivalent to entering "cd" on its own on the command line. An argument of undef also behaves this way, but this is now deprecated behavior since it is too easy to accidentally feed chdir an undefined value through an unset variable that was meant to hold a file name.

On Windows things are a bit more complicated, since Windows preserves a current directory for each drive available to the system. The current directory as understood by Perl is therefore a combination of the currently selected drive and the current working directory on that drive. If we pass a directory to chdir without a drive letter, we remain on the current drive.

There is no direct way in Perl to determine what the current working directory is, since the concept means different things to different platforms. Shells often maintain the current working directory in an environment variable that we can simply check, such as $ENV{PWD} (the name is derived from the Unix pwd command, which stands for "print working directory"). More formally, we can use either the POSIX module or the more specialized Cwd module to find out.

Using the POSIX module, we can find the current working directory by calling the getcwd routine, which maps onto the underlying getcwd or getwd (regional variations may apply) routine provided by the standard C library. It takes no parameters and returns the current working directory as a string:

use POSIX qw(getcwd);
my $cwd = getcwd;

This will work for most, but not all, platforms—a credible getcwd or getwd-like function must be available for the POSIX module to use it. Alternatively, we can use the Cwd module. This is a specialized module dedicated to all issues surrounding the current working directory in as portable a way as possible. It supplies three different ways to determine the current directory:

getcwd and fastcwd are pure Perl implementations that are therefore maximally portable. cwd attempts to use the most natural and safe method to retrieve the current working directory supported by the underlying platform, which might be getcwd or some other operating system interface, depending on whether it be Unix, Windows, VMS, OS/2, and so on.

getcwd is an implementation of the real getcwd as provided by POSIX written purely in Perl. It works by opening the parent directory with opendir, then scanning each file in turn through readdir and lstat, looking for a match with the current directory using the first two values returned (the dev and lno fields). From this it deduces the name of the current directory, and so on all the way to the top of the filing system. This makes getcwd slow, but it will work in the absence of additional cooperation from the operating system. getcwd avoids using chdir, because having chdired out of the current directory, permissions may not allow it to chdir back in again. Instead it assembles an increasingly long string of /../../../ to access each directory in turn. This makes it safe but slow.

fastgetcwd is also a pure Perl implementation. It works just like getcwd but assumes chdir is always safe. Instead of accessing each parent directory through an extending string of /.., it uses chdir to jump up to the parent directory and analyze it directly. This makes it a lot faster than getcwd, but it may mean that the current working directory changes if fastgetcwd fails to restore the current working directory due to its permissions.

cwd attempts to use the best safe and "natural" underlying mechanism available for determining the current working directory, essentially executing the native command to return the current working directory—on a Unix platform this is the pwd command, on Windows it is command /c cd, and so on. It does not use the POSIX module. If all else fails, the Perl-only getcwd covered previously is used. This makes it the best solution for most applications, since it takes advantage of OS support if any is available, but it can survive happily (albeit slowly) without. However, it is slower than the POSIX module because it usually executes an external program.

All three methods will return the true path to the file, resolving and removing any symbolic links (should we be on a platform that supports them) in the pathname. All four functions (including the alias getfastcwd) are automatically imported when we use the module and are called in the same way, taking no parameters and returning the current working directory:

use Cwd;            # import 'getcwd', 'fastcwd', 'fastgetcwd', and 'cwd'

$cwd = getcwd;      # slow, safe Perl
$cwd = fastcwd;     # faster but potentially unsafe Perl
$cwd = getfastcwd;  # alias for 'fastcwd'
$cwd = cwd;         # use native platform support

If we only want to use one of these functions, say cwd, we can tell the module to export just that one function with

use Cwd qw(cwd);

Sometimes we want to find the path to a directory other than the one we are currently in. One way to do that is to chdir to the directory in question, determine the current working directory, and then chdir back. Since this is a chore, the Cwd module encapsulates the process in abs_path (alias realpath) and fast_abs_path functions, each of which can be imported into our application by explicitly naming them. Both take a path to a file or directory and return the true absolute path to it, resolving any symbolic links and instances of . or .. as they go:

use Cwd qw(abs_path realpath fast_abs_path);

# find the real path of 'filename'
$absdir = abs_path('symboliclink/filename'),

# 'realpath' is an alias for 'abs_path'
$absdir = realpath('symboliclink/filename'),

# find the real path of our great grand parent directory
$absdir = fast_abs_path('../../..'),

The cwd function is actually just a wrapper to abs_path with an argument of .. By contrast, fast_abs_path is a wrapper for getcwd that uses chdir to change to the requested directory beforehand and chdir again to restore the current working directory afterward.

In addition to the various cwd functions and the abs_path routines, Cwd supplies one more routine, chdir, that improves the standard built-in chdir by automatically tracking changes in the environment variable $ENV{PWD} in the same manner as some shells do. We can have this chdir override the standard chdir by importing it specifically:

# override system 'chdir'
use Cwd qw(chdir);

After this, chdir will automatically update $ENV{PWD} each time we use it. The original chdir is still available as CORE::chdir, of course.

Summary

In this chapter, we covered Perl's interaction with the filing system, including the naming of files and directories, testing for the existence of files, using the built-in stat and lstat functions, and deleting, renaming, copying, moving, comparing, and finding files.

Doing all of this portably can be a challenge, but fortunately Perl helps us out, first by natively understanding Unix-style filenaming conventions on almost any platform, and second by providing the File:: family of modules for portable file system operations. File::Spec and File::Spec::Functions are the underlying foundation for these modules, while modules like File::Basename and File::Copy provide higher-level functionality we can use for portable file system manipulation.

We also looked at Perl's glob operator and the underlying File::Glob:: modules that modern Perls invoke when we use it. We went on to look at the creating and use of temporary files, a special case of filing system interaction that can be very important to get right. Finally, we took a special look at the particular properties and problems of managing directories, which are like files in some ways but quite unlike them in many others.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset