Currently, there are more than 200 modules in the standard distribution, covering topics such as string and text processing, networking and web tools, system interfaces, database interfaces, serialization, data structures and algorithms, user interfaces, numerical computing, and others. We touch on only the most widely used here and mention some of the more powerful and specialized ones in Chapter 9, and Chapter 10.
The
string
module is
somewhat of a historical anomaly. If Python were being designed
today, chances are many functions currently in the
string
module would be implemented instead as
methods of string objects.[57]
The
string
module operates on strings. Table 8.4 lists the most useful functions defined in the
string
module, along with brief descriptions, just
to give you an idea as to the module’s purpose. The
descriptions given here are not complete; for an exhaustive listing,
check the Library Reference or the
Python Pocket Reference. Except when otherwise
noted, each function returns a string.
Table 8-4. String Module Functions
The string module also defines a few useful constants, as shown in Table 8.5.
Table 8-5. String Module Constants
Constant Name |
Value |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
[a] On
most systems, the |
The constants in Table 8.5
Table 8.5 generally test whether specific characters fit
a criterion—for example, x in string.whitespace
returns true only if x
is one of the whitespace characters.
A typical use of the string
module is to clean up
user input. The following line removes all “extra”
whitespace, meaning it replaces sequences of whitespace with single
space characters, and it deletes leading and trailing spaces:
thestring = string.strip(string.join(string.split(thestring)))
The string
module defines basic operations on
strings. It shows up in almost all programs that interact with files
or users. Because Python strings can contain null bytes, they can
also process binary data—more on this when we get to the
struct
module.
In addition, Python provides a specialized string-processing tool to
use with regular expressions. For a long time, Python’s regular
expressions (available in the regex
and
regsub
modules), while adequate for some tasks,
were not up to par with those offered by competing languages, such as
Perl. As of Python 1.5, a new module called re
provides a completely overhauled regular expression package, which
significantly enhances Python’s string-processing abilities.
Regular expressions are strings that let you define complicated
pattern matching and replacement rules for strings. These strings are
made up of symbols that emphasize compact notation over mnemonic
value. For example, the single character .
means
“match any single character.” The character
+
means “one or more of what just preceded
me.” Table 8.6 lists some of the most
commonly used regular expression symbols and their meanings in
English.
Table 8-6. Common Elements of Regular Expression Syntax
Special Character |
Meaning |
---|---|
|
Matches any character except newline by default |
|
Matches the start of the string |
|
Matches the end of the string |
|
“Any number of occurrences of what just preceded me” |
|
“One or more occurrences of what just preceded me” |
|
“Either the thing before me or the thing after me” |
|
Matches any alphanumeric character |
|
Matches any decimal digit |
|
Matches the string |
Suppose you need to write a program to replace the strings “green pepper” and “red pepper” with “bell pepper” if and only if they occur together in a paragraph before the word “salad” and not if they are followed (with no space) by the string “corn.” These kinds of requirements are surprisingly common in computing. Assume that the file you need to process is called pepper.txt. Here’s a silly example of such a file:
This is a paragraph that mentions bell peppers multiple times. For one, here is a red pepper and dried tomato salad recipe. I don't like to use green peppers in my salads as much because they have a harsher flavor. This second paragraph mentions red peppers and green peppers but not the "s" word (s-a-l-a-d), so no bells should show up. This third paragraph mentions red peppercorns and green peppercorns, which aren't vegetables but spices (by the way, bell peppers really aren't peppers, they're chilies, but would you rather have a good cook or a good botanist prepare your salad?).
The first task is to open it and read in the text:
file = open('pepper.txt') text = file.read()
We read the entire text at once and avoid splitting it into lines,
since we will assume that paragraphs are defined by two consecutive
newline characters. This is easy to do using the
split
function of the string
module:
import string paragraphs = string.split(text, ' ')
At this point we’ve split the text into a list of paragraph strings, and all there is left is to do is perform the actual replacement operation. Here’s where regular expressions come in:
import rematchstr = re.compile(
r"""(red|green)
# 'red' or 'green' starting new words(s+
# followed by whitespacepepper
# the word 'pepper'(?!corn)
# if not followed immediately by 'corn'(?=.*salad))""",
# and if followed at some point by 'salad'',re.IGNORECASE |
# allow pepper, Pepper, PEPPER, etc.re.DOTALL |
# allow to match newlines as wellre.VERBOSE)
# this allows the comments and the newlines above for paragraph in paragraphs: fixed_paragraph = matchstr.sub(r'bell2', paragraph) print fixed_paragraph+' '
The bold line is the hardest one; it creates a compiled regular expression pattern, which is like a program. Such a pattern specifies two things: which parts of the strings we’re interested in and how they should be grouped. Let’s go over these in turn.
Defining which parts of the
string we’re interested in is done by specifying a pattern of
characters that defines a match. This is done by concatenating
smaller patterns, each of which specifies a simple matching criterion
(e.g., “match the string 'pepper'
,”
“match one or more whitespace characters,”
“don’t match 'corn'
,” etc.). As
mentioned, we’re looking for the words red
or green
, if they’re followed by the word
pepper
, that is itself followed by the word
salad
, as long as pepper
isn’t followed immediately by 'corn'
.
Let’s take each line of the re.compile(...)
expression in turn.
The first thing to notice about the string in the
re.compile()
is that it’s a
“raw” string (the quotation marks are preceded by an
r)
. Prepending such an r
to a
string (single- or triple-quoted) turns off the interpretation of the
backslash characters within the string.[58] We could have used a regular
string instead and used \b
instead of
and
\s
instead of
s
. In this case, it makes little difference; for
complicated regular expressions, raw strings allow much more clear
syntax than escaped backslashes.
The first line in the pattern is (red|green)
.
stands for “the empty string, but only at
the beginning or end of a word”; using it here prevents matches
that have
red
or green
as the
final part of a word (as in “tired pepper”). The
(red|green)
pattern specifies an
alternation: either 'red'
or
'green'
. Ignore the left parenthesis that follows
for now. s
is a special symbol that means
“any whitespace character,” and +
means “one or more occurrence of whatever comes before
me,” so, put together, s+
means “one
or more whitespace characters.” Then, pepper
just means the string 'pepper'
.
(?!corn)
prevents matches of “patterns that
have 'corn'
at this point,” so we prevent
the match on 'peppercorn'
. Finally,
(?=.*salad)
says that for the pattern to match, it
must be followed by any number of characters (that’s what
.*
means), followed by the word
salad
. The ?=
bit specifies
that while the pattern should determine whether the match occurs, it
shouldn’t be “used up” by the match process;
it’s a subtle point, which we’ll ignore for now. At this
point we’ve defined the pattern corresponding to the substring.
Now, note that there are two parentheses we haven’t explained
yet—the one before s+
and the last one.
What these two do is define a “group,” which starts after
the red or green and go to the end of the pattern. We’ll use
that group in the next operation, the actual replacement. First, we
need to mention the three flags that are joined by the logical
operation “or”. These specify kinds
of pattern matches. The first,
re.IGNORECASE
, says that the text comparisons
should ignore whether the text and the match have similar or
different cases. The second, re.DOTALL
, specifies
that the .
character should match any character,
including the newline character (that’s not the default
behavior). Finally, the third, re.VERBOSE
, allows
us to insert extra newlines and #
comments in the
regular expression, making it easier to read and understand. We could
have written the statement more compactly as:
matchstr = re.compile(r"(red|green)(s+pepper(?!corn)(?=.*salad))", re.I | re.S)
The actual replacement operation is done with the line:
fixed_paragraph = matchstr.sub(r'bell2', paragraph)
First, it should be fairly clear that we’re calling the
sub
method of the matchstr
object. That object is a compiled regular expression
object, meaning that some of the processing of the
expression has already been done (in this case, outside the loop),
thus speeding up the total program execution. We use a raw string
again to write the first argument to the method. The
2
is a reference to group 2 in the regular
expression—the second group of parentheses in the regular
expression—in our case, everything starting with
pepper
and up to and including the word
'salad'
. This line therefore means, “Replace
the matched string with the string that is 'bell'
followed by whatever starts with 'pepper'
and goes
up to the end of the matched string, in the
paragraph
string.”
So, does it work? The pepper.txt file we saw
earlier had three paragraphs: the first satisfied the requirements of
the match twice, the second didn’t because it didn’t
mention the word “salad,” and the third didn’t
because the red
and green
words
are before peppercorn
, not
pepper
. As it was supposed to, our program (saved
in a file called pepper.py
) modifies only the
first paragraph:
/home/David/book$python pepper.py
This is a paragraph that mentions bell peppers multiple times. For one, here is abell
pepper and dried tomato salad recipe. I don't like to usebell
peppers in my salads as much because they have a harsher flavor. This second paragraph mentions red peppers and green peppers but not the "s" word (s-a-l-a-d), so no bells should show up. This third paragraph mentions red peppercorns and green peppercorns, which aren't vegetables but spices (by the way, bell peppers really aren't peppers, they're chilies, but would you rather have a good cook or a good botanist prepare your salad?).
This example, while artificial, shows how regular expressions can compactly express complicated matching rules. If this kind of problem occurs often in your line of work, mastering regular expressions is a worthwhile investment of time and effort.
A thorough coverage of regular expressions is beyond the scope of
this book. Jeffrey Friedl gives an excellent coverage of regular
expressions in his book Mastering Regular
Expressions (O’Reilly & Associates). His
description of Python regular expressions (at least in the First
Edition) uses the old-style syntax, which is no longer the
recommended one, so those specifics should mostly be ignored; the
regular expressions currently used in Python are much more similar to
those of Perl. Still, his book is a must-have for anyone doing
serious text processing. For the casual user (such as these authors),
the descriptions in the Library Reference do the
job most of the time. Use the re
module, not the
regexp
, regex
, and
regsub
modules, which are deprecated.
The operating-system interface defines the mechanism by which programs are expected to manipulate things like files, processes, users, and threads.
The os
module
provides a generic interface to the operating system’s most
basic set of tools. The specific set of calls it defines depend on
which platform you use. (For example, the permission-related calls
are available only on platforms that support them, such as Unix and
Windows.) Nevertheless, it’s recommended that you always use
the os
module, instead of the platform-specific
versions of the module (called posix
,
nt
, and mac
). Table 8.7 lists some of the most often-used functions in
the os
module. When referring to files in the
context of the os
module, one is referring to
filenames, not file objects.
Table 8-7. Most Frequently Used Functions From the os Module
There are many other functions in the os
module;
in fact, any function that’s part of the POSIX standard and
widely available on most Unix platforms is supported by Python on
Unix. The interfaces to these routines follow the
POSIX conventions. You can retrieve and
set UIDs, PIDs, and process groups; control nice levels; create
pipes; manipulate file descriptors; fork processes; wait for child
processes; send signals to processes; use the
execv
variants; etc.
The os
module also defines some important
attributes that aren’t
functions:
The
os.name
attribute defines the current version of the platform-specific
operating-system interface. Registered values for
os.name
are 'posix'
,
'nt'
, 'dos'
, and
'mac'
. It’s different from
sys.platform
, which we discussed earlier in this
chapter.
os.error
defines a class used when calls in the os
module
raise errors. When this exception is raised, the value of the
exception contains two variables. The first is the number
corresponding to the error (known as errno
), and
the second is a string message explaining it (known as
strerror
):
>>>os.rmdir('nonexistent_directory')
# how it usually shows up Traceback (innermost last): File "<stdin>", line 1, in ? os.error: (2, 'No such file or directory') >>>try:
# we can catch the error and take ...os.rmdir('nonexistent directory')
# it apart ...except os.error, value:
...print value[0], value[1]
... 2 No such file or directory
The os.environ
dictionary contains key/value pairs
corresponding to the environment variables of the shell from which
Python was started. Because this environment is inherited by the
commands that are invoked using the os.system
call, modifying the os.environ
dictionary modifies
the environment:
>>>print os.environ['SHELL']
/bin/sh >>>os.environ['STARTDIR'] = 'MyStartDir'
>>>os.system('echo $STARTDIR')
# 'echo %STARTDIR%' on DOS/Win MyStartDir # printed by the shell 0 # return code from echo
The os
module also includes a set of strings that
define portable ways to refer to directory-related operations, as
shown in Table 8.8.
Table 8-8. String Attributes of the os Module
Attribute Name |
Meaning and Value |
---|---|
|
A string that denotes the current directory:
|
|
A string that denotes the parent directory:
|
|
The character that separates pathname components:
|
|
An alternate character to |
|
The character that separates path components:
|
These strings are especially useful when combined with the
functionality in the os.path
module, which
provides many functions that manipulate file paths (see Table 8.9). Note that the os.path
module is an attribute of the os
module;
it’s imported automatically when the os
module is loaded, so you don’t need to import it explicitly.
The outputs of the examples in Table 8.9
correspond to code run on a Windows or DOS machine. On another
platform, the appropriate path separators would be used instead.
Table 8-9. Most Frequently Used Functions from the os.path Module
The keen-eyed reader might have noticed that the
os
module, while it provides lots of file-related
functions, doesn’t include a
copy
function. On DOS, copying a file is
basically the same thing as opening one file in read/binary modes,
reading all its data, opening a second file in write/binary mode, and
writing the data to the second file. On Unix and Windows, making that
kind of copy fails to copy the so-called stat
bits (permissions, modification times, etc.) associated
with the file. On the Mac, that operation won’t copy the
resource fork, which contains data such as icons and dialog boxes. In
other words, copying is just not so simple. Nevertheless, often you
can get away with a fairly simple function that works on Windows,
DOS, Unix, and Macs as long as you’re manipulating just data
files with no resource forks. That function, called
copyfile
, lives in the shutil
module. It includes a few generally useful functions, shown in Table 8.10.
Table 8-10. Functions of the shutil Module
Function Name |
Behavior |
---|---|
|
Makes a copy of the file |
|
Copies mode information (permissions) from
|
|
Copies all stat information (mode, utime) from
|
|
Copies data and mode information from |
|
Copies data and stat information from |
|
Copies a directory recursively using |
|
Recursively deletes the directory indicated by
|
Python programs often process forms from web pages. To make this task
easy, the standard Python distribution includes a module called
cgi
. Chapter 10 includes an
example of a Python script that uses the CGI.
Universal resource locators are strings such as
http://www.python.org/ that are now
ubiquitous.[59] Two
modules, urllib
and urlparse
,
provide tools for processing URLs.
urllib
defines a few functions for writing
programs that must be active users of the Web (robots, agents, etc.).
These are listed in Table 8.11.