Credit: Mark Lutz, author of Programming Python and Python Quick Referenc e, co-author of Learning Python
Behold the file—one of the first things that any reasonably pragmatic programmer reaches for in a programming language’s toolbox. Because processing external files is a very real, tangible task, the quality of file-processing interfaces is a good way to assess the practicality of a programming tool.
As the recipes in this chapter attest, Python shines in this task.
Files in Python are supported in a variety of layers: from the built-in
open
function (a synonym for the
standard file
object type), to
specialized tools in standard library modules such as os
, to third-party utilities available on the
Web. All told, Python’s arsenal of file tools provides several powerful
ways to access files in your scripts.
In Python, a file object is an instance of built-in type
file
. The built-in function
open
creates and returns a file
object. The first argument, a string, specifies the file’s path (i.e.,
the filename preceded by an optional directory path). The second
argument to open
, also a string,
specifies the mode in which to open the file. For example:
input = open('data', 'r') output = open('/tmp/spam', 'w')
open
accepts a file path in which directories and files are
separated by slash characters (/
),
regardless of the proclivities of the underlying operating system. On
systems that don’t use slashes, you can use a backslash character
() instead, but there’s no real
reason to do so. Backslashes are harder to fit nicely in string
literals, since you have to double them up or use “raw” strings. If
the file path argument does not include the file’s directory name, the
file is assumed to reside in the current working directory (which is a
disjoint concept from the Python module search path).
For the mode argument, use 'r
' to read the file in text mode; this is
the default value and is commonly omitted, so that open
is called with just one argument. Other
common modes are 'rb
' to read the
file in binary mode, 'w
' to create
and write to the file in text mode, and 'wb
' to create and write to the file in
binary mode. A variant of 'r
' that
is sometimes precious is 'rU
',
which tells Python to read the file in text mode with “universal
newlines”: mode 'rU
' can read text
files independently of the line-termination convention the files are
using, be it the Unix way, the Windows way, or even the (old) Mac way.
(Mac OS X today is a Unix for all intents and purposes, but releases
of Mac OS 9 and earlier, just a few years ago, were quite
different.)
The distinction between text mode and binary mode is
important on non-Unix-like platforms because of the line-termination
characters used on these systems. When you open a file in binary mode,
Python knows that it doesn’t need to worry about line-termination
characters; it just moves bytes between the file and in-memory strings
without any kind of translation. When you open a file in text mode on
a non-Unix-like system, however, Python knows it must translate
between the '
' line-termination
characters used in strings and whatever the current platform uses in
the file itself. All of your Python code can always rely on '
' as the line-termination character, as
long as you properly indicate text or binary mode when you open the
file.
Once you have a file object, you perform all file I/O by calling
methods of this object, as we’ll discuss in a moment. When you’re done
with the file, you should finish by calling the close
method on the object, to close the
connection to the file:
input.close( )
In short scripts, people often omit this step, as Python
automatically closes the file when a file object is reclaimed during
garbage collection (which in mainstream Python means the file is
closed just about at once, although other important Python
implementations, such as Jython and IronPython, have other, more
relaxed garbage-collection strategies). Nevertheless, it is good
programming practice to close your files as soon as possible, and it
is especially a good idea in larger programs, which otherwise may be
at more risk of having excessive numbers of uselessly open files lying
about. Note that try
/finally
is particularly well suited to
ensuring that a file gets closed, even when a function terminates due
to an uncaught exception.
To write to a file, use the write
method:
output.write(s)
where s
is a string. Think of
s
as a string of characters if
output
is open for text-mode
writing, and as a string of bytes if output
is open for binary-mode writing.
Files have other writing-related methods, such as flush
, to send any data being buffered, and
writelines
, to write a sequence of
strings in a single call. However, write
is by far the most commonly used
method.
Reading from a file is more common than writing to a file, and
more issues are involved, so file objects have more reading methods
than writing ones. The readline
method reads and returns the next line from a text file. Consider the
following loop:
while True: line = input.readline( ) if not line: break process(line)
This was once idiomatic Python but it is no longer the best way
to read and process all of the lines from a file. Another dated
alternative is to use the readlines
method, which reads the whole file and returns a list of
lines:
for line in input.readlines( ): process(line)
readlines
is useful only for
files that fit comfortably in physical memory. If the file is truly
huge, readlines
can fail or at
least slow things down quite drastically (virtual memory fills up and
the operating system has to start copying parts of physical memory to
disk). In today’s Python, just loop on the file object itself to get a
line at a time with excellent memory and performance
characteristics:
for line in input: process(line)
Of course, you don’t always want to read a file line by
line. You may instead want to read some or all of the bytes in the
file, particularly if you’ve opened the file for binary-mode reading,
where lines are unlikely to be an applicable concept. In this case,
you can use the read
method. When
called without arguments, read
reads and returns all the remaining bytes from the file. When read
is called with an integer argument
N
, it reads and returns the next
N
bytes (or all the remaining bytes, if
less than N
bytes remain). Other methods
worth mentioning are seek
and
tell
, which support random access
to files. These methods are normally used with binary files made up of
fixed-length records.
On the surface, Python’s file support is straightforward. However, before you peruse the code in this chapter, I want to underscore two aspects of Python’s file support: code portability and interface flexibility.
Keep in mind that most file interfaces in Python are fully portable across platform boundaries. It would be difficult to overstate the importance of this feature. A Python script that searches all files in a “directory” tree for a bit of text, for example, can be freely moved from platform to platform without source-code changes: just copy the script’s source file to the new target machine. I do it all the time—so much so that I can happily stay out of operating system wars. With Python’s portability, the underlying platform is almost irrelevant.
Also, it has always struck me that Python’s file-processing interfaces are not restricted to real, physical files. In fact, most file tools work with any kind of object that exposes the same interface as a real file object. Thus, a file reader cares only about read methods, and a file writer cares only about write methods. As long as the target object implements the expected protocol, all goes well.
For example, suppose you have written a general file-processing function such as the following, meant to apply a passed-in function to each line of an input file:
def scanner(fileobject, linehandler): for line in fileobject: linehandler(line)
If you code this function in a module file and drop that file
into a “directory” that’s on your Python search path (sys.path
), you can use it any time you need
to scan a text file line by line, now or in the future. To illustrate,
here is a client script that simply prints the first word of each
line:
from myutils import scanner def firstword(line): print line.split( )[0] file = open('data') scanner(file, firstword)
So far, so good; we’ve just coded a small, reusable software
component. But notice that there are no type declarations in the
scanner
function, only an interface constraint—any
object that is iterable line by line will do. For instance, suppose
you later want to provide canned test input from a string object,
instead of using a real, physical file. The standard StringIO
module, and the equivalent but
faster cStringIO
, provide the
appropriate wrapping and interface forgery:
from cStringIO import StringIO from myutils import scanner def firstword(line): print line.split( )[0] string = StringIO('one two xxx three ') scanner(string, firstword)
StringIO
objects are
plug-and-play compatible with file objects, so scanner
takes its three lines of text from
an in-memory string object, rather than a true external file. You
don’t need to change the scanner to make this work—just pass it the
right kind of object. For more generality, you can even use a class to
implement the expected interface instead:
class MyStream(object): def _ _iter_ _(self): # grab and return text from wherever return iter(['a ', 'b c d ']) from myutils import scanner def firstword(line): print line.split( )[0] object = MyStream( ) scanner(object, firstword)
This time, as scanner
attempts to read the
file, it really calls out to the _ _iter_
_
method you’ve coded in your class. In practice, such a
method might use other Python standard tools to grab text from a
variety of sources: an interactive user, a popup GUI input box, a
shelve
object, an SQL database, an
XML or HTML page, a network socket, and so on. The point is that
scanner
doesn’t know or care what type of object is
implementing the interface it expects, or what that interface actually
does.
Object-oriented programmers know this deliberate naiveté
as polymorphism. The type of the object being
processed determines what an operation, such as the for
-loop iteration in
scanner
, actually does. Everywhere in Python, object
interfaces, rather than specific data types, are the unit of coupling.
The practical effect is that functions are often applicable to a much
broader range of problems than you might expect. This is especially
true if you have a background in statically typed languages such as C
or C++. It is almost as if we get C++ templates for free in Python.
Code has an innate flexibility that is a by-product of Python’s strong
but dynamic typing.
Of course, code portability and flexibility run rampant in Python development and are not really confined to file interfaces. Both are features of the language that are simply inherited by file-processing scripts. Other Python benefits, such as its easy scriptability and code readability, are also key assets when it comes time to change file-processing programs. But rather than extolling all of Python’s virtues here, I’ll simply defer to the wonderful recipes in this chapter and this book at large for more details. Enjoy!
Credit: Luther Blissett
You want to read text or data from a file.
Here’s the most convenient way to read all of the file’s contents at once into one long string:
all_the_text = open('thefile.txt').read( ) # all text from a text file all_the_data = open('abinfile', 'rb').read( ) # all data from a binary file
However, it is safer to bind the file object to a name, so that
you can call close
on it as soon as
you’re done, to avoid ending up with open files hanging around. For
example, for a text file:
file_object = open('thefile.txt') try: all_the_text = file_object.read( ) finally: file_object.close( )
You don’t necessarily have to use the try
/finally
statement here, but it’s a good idea
to use it, because it ensures the file gets closed even when an error
occurs during reading.
The simplest, fastest, and most Pythonic way to read a text file’s contents at once as a list of strings, one per line, is:
list_of_all_the_lines = file_object.readlines( )
This leaves a '
' at the end
of each line; if you don’t want that, you have alternatives, such
as:
list_of_all_the_lines = file_object.read( ).splitlines( ) list_of_all_the_lines = file_object.read( ).split(' ') list_of_all_the_lines = [L.rstrip(' ') for L in file_object]
The simplest and fastest way to process a text file one line at
a time is simply to loop on the file object with a for
statement:
for line in file_object:process line
This approach also leaves a '
' at the end of each line; you may remove
it by starting the for
loop’s body
with:
line = line.rstrip(' ')
or even, when you’re OK with getting rid of trailing whitespace
from each line (not just a trailing '
'), the generally handier:
line = line.rstrip( )
Unless the file you’re reading is truly huge, slurping
it all into memory in one gulp is often fastest and most convenient
for any further processing. The built-in function open
creates a Python file object
(alternatively, you can equivalently call the built-in type file
). You call the read
method on that object to get all of the
contents (whether text or binary) as a single long string. If the
contents are text, you may choose to immediately split that string
into a list of lines with the split
method or the specialized splitlines
method. Since splitting into
lines is frequently needed, you may also call readlines
directly on the file object for
faster, more convenient operation.
You can also loop directly on the file object, or pass it to
callables that require an iterable, such as list
or max
—when thus treated as an iterable, a file
object open for reading has the file’s text lines as the iteration
items (therefore, this should be done for text files only). This kind
of line-by-line iteration is cheap in terms of memory consumption and
fairly speedy too.
On Unix and Unix-like systems, such as Linux, Mac OS X, and
other BSD variants, there is no real distinction between text files
and binary data files. On Windows and very old Macintosh systems,
however, line terminators in text files are encoded, not with the
standard '
' separator, but with
'
' and '
', respectively. Python translates these
line-termination characters into '
' on your behalf. This means that you need
to tell Python when you open a binary file, so that it won’t perform
such translation. To do so, use 'rb
' as the second argument to open
. This is innocuous even on Unix-like
platforms, and it’s a good habit to distinguish binary files from text
files even there, although it’s not mandatory in that case. Such good
habits will make your programs more immediately understandable, as
well as more compatible with different platforms.
If you’re unsure about which line-termination convention a
certain text file might be using, use 'rU
' as the second argument to open
, requesting universal endline
translation. This lets you freely interchange text files among
Windows, Unix (including Mac OS X), and old Macintosh systems, without
worries: all kinds of line-ending conventions get mapped to '
', whatever platform your code is running
on.
You can call methods such as read
directly on the file object produced by
the open
function, as shown in the
first snippet of the solution. When you do so, you no longer have a
reference to the file object as soon as the reading operation
finishes. In practice, Python notices the lack of a reference at once,
and immediately closes the file. However, it is better to bind a name
to the result of open
, so that you
can call close
yourself explicitly
when you are done with the file. This ensures that the file stays open
for as short a time as possible, even on platforms such as Jython,
IronPython, and other hypothetical future versions of Python, on which
more advanced garbage-collection mechanisms might delay the automatic
closing that the current version of C-based Python performs at once.
To ensure that a file object is closed even if errors happen during
its processing, the most solid and prudent approach is to use the
try
/finally
statement:
file_object = open('thefile.txt')
try:
for line in file_object:process line
finally:
file_object.close( )
Be careful not to place the call to
open
inside
the try
clause of this try
/finally
statement (a rather common error
among beginners). If an error occurs during the opening, there is
nothing to close, and besides, nothing gets bound to name
file_object
, so you definitely don’t want
to call file_object.close()
!
If you choose to read the file a little at a time, rather than all at once, the idioms are different. Here’s one way to read a binary file 100 bytes at a time, until you reach the end of the file:
file_object = open('abinfile', 'rb') try: while True: chunk = file_object.read(100) if not chunk: break do_something_with(chunk) finally: file_object.close( )
Passing an argument N
to the read
method ensures that read
will read only the next
N
bytes (or fewer, if the file is closer to
the end). read
returns the empty
string when it reaches the end of the file. Complicated loops are best
encapsulated as reusable generators. In this case, we can encapsulate
the logic only partially, because a generator’s yield
keyword is not allowed in the try
clause of a try
/finally
statement. Giving up on the
assurance of file closing afforded by try
/finally
, we can therefore settle for:
def read_file_by_chunks(filename, chunksize=100): file_object = open(filename, 'rb') while True: chunk = file_object.read(chunksize) if not chunk: break yield chunk file_object.close( )
Once this read_file_by_chunks
generator is
available, your application code to read and process a binary file by
fixed-size chunks becomes extremely simple:
for chunk in read_file_by_chunks('abinfile'): do_something_with(chunk)
Reading a text file one line at a time is a frequent task. Just loop on the file object, as in:
for line in open('thefile.txt', 'rU'): do_something_with(line)
Here, too, in order to be 100% certain that no uselessly open file object will ever be left just hanging around, you may want to code this snippet in a more rigorously correct and prudent way:
file_object = open('thefile.txt', 'rU'): try: for line in file_object: do_something_with(line) finally: file_object.close( )
Recipe 2.2;
documentation for the open
built-in
function and file
objects in the
Library Reference and Python in a
Nutshell.
Credit: Luther Blissett
Here is the most convenient way to write one long string to a file:
open('thefile.txt', 'w').write(all_the_text) # text to a text file open('abinfile', 'wb').write(all_the_data) # data to a binary file
However, it is safer to bind the file object to a name, so that
you can call close
on the file
object as soon as you’re done. For example, for a text file:
file_object = open('thefile.txt', 'w') file_object.write(all_the_text) file_object.close( )
Often, the data you want to write is not in one big
string, but in a list (or other sequence) of strings. In this case,
you should use the writelines
method (which, despite its name, is not limited to lines and works
just as well with binary data as with text files!):
file_object.writelines(list_of_text_strings) open('abinfile', 'wb').writelines(list_of_data_strings)
Calling writelines
is much
faster than the alternatives of joining the strings into one big
string (e.g., with ''.join
) and
then calling write
, or calling
write
repeatedly in a loop.
To create a file object for writing, you must always pass a
second argument to open
(or
file
)—either 'w
' to write textual data or 'wb
' to write binary data. The same
considerations detailed previously in Recipe 2.1 apply here,
except that calling close
explicitly is even more advisable when you’re writing to a file rather
than reading from it. Only by closing the file can you be reasonably
sure that the data is actually on the disk and not still residing in
some temporary buffer in memory.
Writing a file a little at a time is even more common than
reading a file a little at a time. You can just call write
and/or writelines
repeatedly, as each string or
sequence of strings to write becomes ready. Each write operation
appends data at the end of the file, after all the previously written
data. When you’re done, call the close
method on the file object. If all the
data is available at once, a single writelines
call is faster and simpler.
However, if the data becomes available a little at a time, it’s better
to call write
as the data comes,
than to build up a temporary list of pieces (e.g., with append
) just in order to be able to write it
all at once in the end with writelines
. Reading and writing are quite
different, with respect to the performance and convenience
implications of operating “in bulk” versus operating a little at a
time.
When you open a file for writing with option 'w
' (or 'wb
'), any data that might already have been
in the file is immediately destroyed; even if you close the file
object immediately after opening it, you still end up with an empty
file on the disk. If you want the data you’re writing to be appended
to the previous contents of the file, open the file with option
'a
' (or 'ab
') instead. More advanced options allow
both reading and writing on the same open file object—in particular,
see Recipe 2.8 for
option 'r+b
', which, in practice,
is the only frequently used one out of all the advanced option
strings.
Recipe 2.1;
Recipe 2.8;
documentation for the open
built-in
function and file objects in the Library
Reference and Python in a
Nutshell.
Credit: Jeff Bauer, Adam Krieg
You need to change one string into another throughout a file.
String substitution is most simply performed by the
replace
method of string objects.
The work here is to support reading from a specified file (or standard
input) and writing to a specified file (or standard output):
#!/usr/bin/env python import os, sys nargs = len(sys.argv) if not 3 <= nargs <= 5: print "usage: %s search_text replace_text [infile [outfile]]" % os.path.basename(sys.argv[0]) else: stext = sys.argv[1] rtext = sys.argv[2] input_file = sys.stdin output_file = sys.stdout if nargs > 3: input_file = open(sys.argv[3]) if nargs > 4: output_file = open(sys.argv[4], 'w')for s in input_file: output_file.write(s.replace(stext, rtext)) output.close( ) input.close( )
This recipe is really simple, but that’s what beautiful about it—why do complicated stuff when simple stuff suffices? As indicated by the leading “shebang” line, the recipe is a simple main script, meaning a script meant to be run directly at a shell command prompt, as opposed to a module meant to be imported from elsewhere. The script looks at its arguments to determine the search text, the replacement text, the input file (defaulting to standard input), and the output file (defaulting to standard output). Then, it loops over each line of the input file, writing to the output file a copy of the line with the substitution performed on it. That’s all! For accuracy, the script closes both files at the end.
As long as an input file fits comfortably in memory in two
copies (one before and one after the replacement, since strings are
immutable), we could, with an increase in speed, operate on the entire
input file’s contents at once instead of looping. With today’s low-end
PCs typically containing at least 256 MB of memory, handling files of
up to about 100 MB should not be a problem, and few text files are
bigger than that. It suffices to replace the for
loop with one single statement:
output_file.write(input_file.read( ).replace(stext, rtext))
As you can see, that’s even simpler than the loop used in the recipe.
Documentation for the open
built-in function, file
objects,
and strings’ replace
method in the
Library Reference and Python in a
Nutshell.
Credit: Luther Blissett
The standard Python library linecache
module makes this task a
snap:
import linecache theline = linecache.getline(thefilepath, desired_line_number)
The standard linecache
module
is usually the optimal Python solution for this task. linecache
is particularly useful when you
have to perform this task repeatedly for several lines in a file,
since linecache
caches information
to avoid uselessly repeating work. When you know that you won’t be
needing any more lines from the cache for a while, call the module’s
clearcache
function to free the
memory used for the cache. You can also use checkcache
if the file may have changed on
disk and you must make sure you are getting the updated
version.
linecache
reads and caches
all of the text file whose name you pass to it, so, if it’s a very
large file and you need only one of its lines, linecache
may be doing more work than is
strictly necessary. Should this happen to be a bottleneck for your
program, you may get an increase in speed by coding an explicit loop,
encapsulated within a function, such as:
def getline(thefilepath, desired_line_number): if desired_line_number < 1: return '' for current_line_number, line in enumerate(open(thefilepath, 'rU')): if current_line_number == desired_line_number-1: return line return ''
The only detail requiring attention is that enumerate
counts from 0, so, since we assume
the desired_line_number
argument counts from 1, we
need the -1
in the ==
comparison.
Documentation for the linecache
module in the Library
Reference and Python in a Nutshell;
Perl Cookbook recipe 8.8.
Credit: Luther Blissett
The simplest approach for reasonably sized files is to read the
file as a list of lines, so that the count of lines is the length of
the list. If the file’s path is in a string bound to a variable named
thefilepath
, all the code you need to
implement this approach is:
count = len(open(thefilepath, 'rU').readlines( ))
For a truly huge file, however, this simple approach may be very slow or even fail to work. If you have to worry about humongous files, a loop on the file always works:
count = -1 for count, line in enumerate(open(thefilepath, 'rU')): pass count += 1
A tricky alternative, potentially faster for truly humongous
files, for when the line terminator is '
' (or has '
' as a substring, as happens on
Windows):
count = 0 thefile = open(thefilepath, 'rb') while True: buffer = thefile.read(8192*1024) if not buffer: break count += buffer.count(' ') thefile.close( )
The 'rb
' argument to open
is necessary if you’re after
speed—without that argument, this snippet might be very slow on
Windows.
When an external program counts a file’s lines, such as wc -l on Unix-like platforms, you can of
course choose to use that (e.g., via os.popen
). However, it’s generally simpler,
faster, and more portable to do the line-counting in your own program.
You can rely on almost all text files having a reasonable size, so
that reading the whole file into memory at once is feasible. For all
such normal files, the len
of the
result of readlines
gives you the
count of lines in the simplest way.
If the file is larger than available memory (say, a few hundred
megabytes on a typical PC today), the simplest solution can become
unacceptably slow, as the operating system struggles to fit the file’s
contents into virtual memory. It may even fail, when swap space is
exhausted and virtual memory can’t help any more. On a typical PC,
with 256MB RAM and virtually unlimited disk space, you should still
expect serious problems when you try to read into memory files above,
say, 1 or 2 GB, depending on your operating system. (Some operating
systems are much more fragile than others in handling virtual-memory
issues under such overly stressed load conditions.) In this case,
looping on the file object, as shown in this recipe’s Solution, is
better. The enumerate
built-in
keeps the line count without your code having to do it
explicitly.
Counting line-termination characters while reading the file by bytes in reasonably sized chunks is the key idea in the third approach. It’s probably the least immediately intuitive, and it’s not perfectly cross-platform, but you might hope that it’s fastest (e.g., when compared with recipe 8.2 in the Perl Cookbook).
However, in most cases, performance doesn’t really matter all that much. When it does matter, the time-sink part of your program might not be what your intuition tells you it is, so you should never trust your intuition in this matter—instead, always benchmark and measure. For example, consider a typical Unix syslog file of middling size, a bit over 18 MB of text in 230,000 lines:
[situ@tioni nuc]$ wc nuc 231581 2312730 18508908 nuc
And consider the following testing-and-benchmark framework script, bench.py:
import time def timeo(fun, n=10): start = time.clock( ) for i in xrange(n): fun( ) stend = time.clock( ) thetime = stend-start return fun._ _name_ _, thetime import os def linecount_w( ): return int(os.popen('wc -l nuc').read( ).split( )[0]) def linecount_1( ): return len(open('nuc').readlines( )) def linecount_2( ): count = -1 for count, line in enumerate(open('nuc')): pass return count+1 def linecount_3( ): count = 0 thefile = open('nuc', 'rb') while True: buffer = thefile.read(65536) if not buffer: break count += buffer.count(' ') return count for f in linecount_w, linecount_1, linecount_2, linecount_3: print f._ _name_ _, f( ) for f in linecount_1, linecount_2, linecount_3: print "%s: %.2f"%timeo(f)
First, I print the line-counts obtained by all methods, thus
ensuring that no anomaly or error has occurred (counting tasks are
notoriously prone to off-by-one errors). Then, I run each alternative
10 times, under the control of the timing function
timeo
, and look at the results. Here they
are, on the old but reliable machine I measured them on:
[situ@tioni nuc]$ python -O bench.pylinecount_w 231581
linecount_1 231581
linecount_2 231581
linecount_3 231581
linecount_1: 4.84
linecount_2: 4.54
linecount_3: 5.02
As you can see, the performance differences hardly matter: your users will never even notice a difference of 10% or so in one auxiliary task. However, the fastest approach (for my particular circumstances, on an old but reliable PC running a popular Linux distribution, and for this specific benchmark) is the humble loop-on-every-line technique, while the slowest one is the fancy, ambitious technique that counts line terminators by chunks. In practice, unless I had to worry about files of many hundreds of megabytes, I’d always use the simplest approach (i.e., the first one presented in this recipe).
Measuring the exact performance of code snippets (rather
than blindly using complicated approaches in the hope that they’ll be
faster) is very important—so important, indeed, that the Python
Standard Library includes a module, timeit
, specifically designed for such
measurement tasks. I suggest you use timeit
, rather than coding your own little
benchmarks as I have done here. The benchmark I just showed you is one
I’ve had around for years, since well before timeit
appeared in the standard Python
library, so I think I can be forgiven for not using timeit
in this specific case!
The Library Reference and
Python in a Nutshell sections on file
objects, the enumerate
built-in, os.popen
, and the time
and timeit
modules; Perl
Cookbook recipe 8.2.
Credit: Luther Blissett
This task is best handled by two nested loops, one on lines and another on the words in each line:
for line in open(thefilepath): for word in line.split( ): dosomethingwith(word)
The nested for
statement’s
header implicitly defines words as sequences of nonspaces separated by
sequences of spaces (just as the Unix program wc does). For other definitions of words,
you can use regular expressions. For example:
import re re_word = re.compile(r"[w'-]+") for line in open(thefilepath): for word in re_word.finditer(line): dosomethingwith(word.group(0))
In this case, a word is defined as a maximal sequence of alphanumerics, hyphens, and apostrophes.
If you want to use other definitions of words, you will obviously need different regular expressions. The outer loop, on all lines in the file, won’t change.
It’s often a good idea to wrap iterations as iterator objects, and this kind of wrapping is most commonly and conveniently obtained by coding simple generators:
def words_of_file(thefilepath, line_to_words=str.split): the_file = open(thefilepath): for line in the_file: for word in line_to_words(line): yield word the_file.close( ) for word in words_of_file(thefilepath): dosomethingwith(word)
This approach lets you separate, cleanly and effectively, two
different concerns: how to iterate over all items (in this case, words
in a file) and what to do with each item in the iteration. Once you
have cleanly encapsulated iteration concerns in an iterator object
(often, as here, a generator), most of your uses of iteration become
simple for
statements. You can
often reuse the iterator in many spots in your program, and if
maintenance is ever needed, you can perform that maintenance in just
one place—the definition of the iterator—rather than having to hunt
for all uses. The advantages are thus very similar to those you obtain
in any programming language by appropriately defining and using
functions, rather than copying and pasting pieces of code all over the
place. With Python’s iterators, you can get these reuse advantages for
all of your looping-control structures, too.
We’ve taken the opportunity afforded by the refactoring
of the loop into a generator to perform two minor
enhancements—ensuring the file is explicitly closed, which is always a
good idea, and generalizing the way each line is split into words
(defaulting to the split
method of
string objects, but leaving a door open to more generality). For
example, when we need words as defined by a regular expression, we can
code another wrapper on top of words_of_file
thanks
to this “hook”:
import re def words_by_re(thefilepath, repattern=r"[w'-]+"): wre = re.compile(repattern) def line_to_words(line): for mo in wre.finditer(line): return mo.group(0) return words_of_file(thefilepath, line_to_words)
Here, too, we supply a reasonable default for the regular expression pattern defining a word but still make it easy to pass a different value in those cases in which different definitions are necessary. Excessive generalization is a pernicious temptation, but a little tasteful generalization suggested by experience will most often amply repay the modest effort it requires. Having a function accept an optional argument, while providing the most likely value for the argument as the default value, is among the simplest and handiest ways to implement this modest and often worthwhile kind of generalization.
Chapter 19 for more on
iterators and generators; Library Reference and
Python in a Nutshell on file
objects and the re
module; Perl
Cookbook recipe 8.3.
Credit: Luther Blissett
You want to read a binary record from somewhere inside a large file of fixed-length records, without reading a record at a time to get there.
The byte offset of the start of a record in the file is the size of a record, in bytes, multiplied by the progressive number of the record (counting from 0). So, you can just seek right to the proper spot, then read the data. For example, to read the seventh record from a binary file where each record is 48 bytes long:
thefile = open('somebinfile', 'rb') record_size = 48 record_number = 6 thefile.seek(record_size * record_number) buffer = thefile.read(record_size)
Note that the record_number
of the
seventh record is 6: record numbers count from
zero!
This approach works only on files (generally binary ones)
defined in terms of records that are all the same fixed size in bytes;
it doesn’t work on normal text files. For clarity, the recipe shows
the file being opened for reading as a binary file by passing
'rb
' as the second argument to
open
, just before the seek
. As long as the file object is open for
reading as a binary file, you can perform as many seek
and read
operations as you need, before
eventually closing the file again—you don’t necessarily open the file
just before performing a seek
on
it.
The section of the Library Reference and Python in a Nutshell on file objects; Perl Cookbook recipe 8.12.
Credit: Luther Blissett
You want to read a binary record from somewhere inside a large file of fixed-length records, change some or all of the values of the record’s fields, and write the record back.
Read the record, unpack it, perform whatever computations you need for the update, pack the fields back into the record, seek to the start of the record again, write it back. Phew. Faster to code than to say:
import struct format_string = '8l' # e.g., say a record is 8 4-byte integers thefile = open('somebinfile', 'r+b') record_size = struct.calcsize(format_string) record_number = 6 thefile.seek(record_size * record_number) buffer = thefile.read(record_size) fields = list(struct.unpack(format_string, buffer)) # Perform computations, suitably modifying fields, then: buffer = struct.pack(format_string, *fields) thefile.seek(record_size * record_number) thefile.write(buffer) thefile.close( )
This approach works only on files (generally binary ones)
defined in terms of records that are all the same, fixed size; it
doesn’t work on normal text files. Furthermore, the size of each
record must be that defined by a struct
format string, as shown in the
recipe’s code. A typical format string, for example, might be
'8l
', to specify that each record
is made up of eight four-byte integers, each to be interpreted as a
signed value and unpacked into a Python int
. In this case, the
fields
variable in the recipe would be
bound to a list of eight int
s. Note
that struct.unpack
returns a tuple.
Because tuples are immutable, the computation would have to rebind the
entire fields
variable. A list is mutable,
so each field can be rebound as needed. Thus, for convenience, we
explicitly ask for a list when we bind
fields
. Make sure, however, not to alter
the length of the list. In this case, it needs to remain composed of
exactly eight integers, or the struct.pack
call will raise an exception
when we call it with a format_string
of
'8l
‘. Also, this recipe is not
suitable when working with records that are not all of the same,
unchanging length.
To seek back to the start of the record, instead of
using the record_size*record_number
offset again, you may choose to do a relative seek:
thefile.seek(-record_size, 1)
The second argument to the seek
method (1
) tells the file object to seek relative to
the current position (here, so many bytes back, because we used a
negative number as the first argument). seek
’s default is to seek to an absolute
offset within the file (i.e., from the start of the file). You can
also explicitly request this default behavior by calling seek
with a second argument of 0
.
You don’t need to open the file just before you do the first
seek
, nor do you need to close it
right after the write
. Once you
have a file object that is correctly opened (i.e., for updating and as
a binary rather than a text file), you can perform as many updates on
the file as you want before closing the file again. These calls are
shown here to emphasize the proper technique for opening a file for
random-access updates and the importance of closing a file when you
are done with it.
The file needs to be opened for updating (i.e., to allow both
reading and writing). That’s what the 'r+b
' argument to open
means: open for reading and writing,
but do not implicitly perform any transformations on the file’s
contents because the file is a binary one. (The 'b
' part is unnecessary but still recommended
for clarity on Unix and Unix-like systems. However, it’s absolutely
crucial on other platforms, such as Windows.) If you’re creating the
binary file from scratch, but you still want to be able to go back,
reread, and update some records without closing and reopening the
file, you can use a second argument of 'w+b
' instead. However, I have never
witnessed this strange combination of requirements; binary files are
normally first created (by opening them with 'wb
', writing data, and closing the file) and
later reopened for updating with 'r+b
‘.
While this approach is normally useful only on a file whose
records are all the same size, another, more advanced possibility
exists: a separate “index file” that provides the offset and length of
each record inside the “data file”. Such indexed sequential access
approaches aren’t much in fashion any more, but they used to be very
important. Nowadays, one meets just about only text files (of many
kinds, more and more often XML ones), databases, and occasional binary
files with fixed-length records. Still, if you do need to access an
indexed sequential binary file, the code is quite similar to that
shown in this recipe, except that you must obtain the
record_size
and the offset argument to pass to
thefile.seek
by reading them from
the index file, rather than computing them yourself as shown in this
recipe’s Solution.
The sections of the Library Reference and
Python in a Nutshell on file objects and the
struct
module; Perl
Cookbook recipe 8.13.
Credit: Paul Prescod, Alex Martelli
You want to directly examine some or all of the files contained in an archive in zip format, without expanding them on disk.
zip files are a popular,
cross-platform way of archiving files. The Python Standard Library
comes with a zipfile
module to
access such files easily:
import zipfile z = zipfile.ZipFile("zipfile.zip", "r") for filename in z.namelist( ): print 'File:', filename, bytes = z.read(filename) print 'has', len(bytes), 'bytes'
Python can work directly with data in zip files. You can look at the list of items in the archive’s directory and work with the “data file"s themselves. This recipe is a snippet that lists all of the names and content lengths of the files included in the zip archive zipfile.zip.
The zipfile
module does not
currently handle multidisk zip
files nor zip files with appended
comments. Take care to use r
as the flag
argument, not rb
, which might seem more
natural (e.g., on Windows). With ZipFile
, the flag is not used the same way
when opening a file, and rb
is not
recognized. The r
flag handles the
inherently binary nature of all zip files on all platforms.
When a zip file contains
some Python modules (meaning .py
or preferably .pyc files),
possibly in addition to other (data) files, you can add the file’s
path to Python’s sys.path
and then
use the import
statement to import
modules from the zip file. Here’s
a toy, self-contained, purely demonstrative example that creates such
a zip file on the fly, imports a
module from it, then removes it—all just to show you how it’s
done:
import zipfile, tempfile, os, sys handle, filename = tempfile.mkstemp('.zip') os.close(handle) z = zipfile.ZipFile(filename, 'w') z.writestr('hello.py', 'def f( ): return "hello world from "+_ _file_ _ ') z.close( ) sys.path.insert(0, filename) import hello print hello.f( ) os.unlink(filename)
Running this script emits something like:
hello world from /tmp/tmpESVzeY.zip/hello.py
Besides illustrating Python’s ability to import from a
zip file, this snippet also shows
how to make (and later remove) a temporary file, and how to use the
writestr
method to add a member to
a zip file without placing that
member into a disk file first.
Note that the path to the zip file from which you import
is treated somewhat like a directory.
(In this specific example run, that path is /tmp/tmpESVzeY.zip
, but of course, since
we’re dealing with a temporary file, the exact value of the path can
change at each run, depending also on your platform.) In particular,
the _ _file_ _
global variable,
within the module hello
, which is import
ed from the zip file, has a value of /tmp/tmpESVzeY.zip/hello.py—a
pseudo-path, made up of the zip file’s path seen as a “directory”
followed by the relative path of hello.py within the zip file. If you import from a zip file a module that computes paths
relative to itself in order to get to data files, you need to adapt
the module to this effect, because you cannot just open
such a “pseudo-path” to get a file
object: rather, to read or write files inside a zip file, you must use functions from
standard library module zipfile
, as
shown in the solution.
For more information about importing modules from a zip file, see Recipe 16.12. While that recipe is Unix-specific, the information in the recipe’s Discussion about importing from zip files is also valid for Windows.
Documentation for the zipfile
module in the Library Reference and
Python in a Nutshell; modules tempfile
, os
, sys
;
for archiving a tree of files, see Recipe 2.11; for more
information about importing modules from a zip file, Recipe 16.12.
Credit: Indyana Jones
Your program receives a zip file as a string of bytes in memory, and you need to read the information in this zip file.
Solving this kind of problem is exactly what standard
library module cStringIO
is
for:
import cStringIO, zipfile class ZipString(ZipFile): def _ _init_ _(self, datastring): ZipFile._ _init_ _(self, cStringIO.StringIO(datastring))
I often find myself faced with this task—for example, zip files coming from BLOB fields in a
database or ones received from a network connection. I used to save
such binary data to a temporary file, then open the file with the
standard library module zipfile
. Of
course, I had to ensure I deleted the temporary file when I was done.
Then I thought of using the standard library module cStringIO
for the purpose . . . and never
looked back.
Module cStringIO
lets you
wrap a string of bytes so it can be accessed as a file object. You can
also do things the other way around, writing into a cStringIO.StringIO
instance as if it were a
file object, and eventually recovering its contents as a string of
bytes. Most Python modules that take file objects don’t check whether
you’re passing an actual file
—rather, any
file-like object will do; the module’s code just
calls on the object whatever file methods it needs. As long as the
object supplies those methods and responds correctly when they’re
called, everything just works. This demonstrates the awesome power of
signature-based polymorphism and hopefully teaches why you should
almost never type-test (utter such horrors as
if
type(x)
is y
, or even just the lesser horror if isinstance(x, y)
) in your own code! A few
low-level modules, such as marshal
,
are unfortunately adamant about using “true” files, but zipfile
isn’t, and this recipe shows how
simple it makes your life!
If you are using a version of Python that is different
from the mainstream C-coded one, known as “CPython”, you may not find
module cStringIO
in the standard
library. The leading c in the name of the module
indicates that it’s a C-specific module, optimized for speed but not
guaranteed to be in the standard library for other compliant Python
implementations. Several such alternative implementations include both
production-quality ones (such as Jython, which is coded in Java and
runs on a JVM) and experimental ones (such as pypy, which is coded in
Python and generates machine code, and IronPython, which is coded in
C# and runs on Microsoft’s .NET CLR). Not to worry: the Python
Standard Library always includes module StringIO
, which is coded in pure Python (and
thus is usable from any compliant implementation of Python), and
implements the same functionality as module cStringIO
(albeit not quite as fast, at
least on the mainstream CPython implementation). You just need to
alter your import
statement a bit
to make sure you get cStringIO
when
available and StringIO
otherwise.
For example, this recipe might become:
import zipfile try: from cStringIO import StringIO except ImportError: from StringIO import StringIO class ZipString(ZipFile): def _ _init_ _(self, datastring): ZipFile._ _init_ _(self, StringIO(datastring))
With this modification, the recipe becomes useful in Jython, and other, alternative implementations.
Modules zipfile
and cStringIO
in the Library
Reference and Python in a Nutshell;
Jython is at http://www.jython.org/; pypy is at http://codespeak.net/pypy/;
IronPython is at http://ironpython.com/.
Credit: Ed Gordon, Ravi Teja Bhupatiraju
You need to archive all of the files and folders in a subtree
into a tar archive file,
compressing the data with either the popular gzip
approach or the higher-compressing
bzip2
approach.
The Python Standard Library’s tarfile
module directly supports either kind
of compression: you just need to specify the kind of compression you
require, as part of the option string that you pass when you call
tarfile.TarFile.open
to create the
archive file. For example:
import tarfile, os def make_tar(folder_to_backup, dest_folder, compression='bz2'): if compression: dest_ext = '.' + compression else: dest_ext = '' arcname = os.path.basename(folder_to_backup) dest_name = '%s.tar%s' % (arcname, dest_ext) dest_path = os.path.join(dest_folder, dest_name) if compression: dest_cmp = ':' + compression else: dest_cmp = '' out = tarfile.TarFile.open(dest_path, 'w'+dest_cmp) out.add(folder_to_backup, arcname) out.close( ) return dest_path
You can pass, as argument compression
to
function make_tar
, the string 'gz
' to get gzip
compression instead of the default
bzip2
, or you can pass the empty
string '' to get no compression at all. Besides making the file
extension of the result either .tar, .tar.gz, or .tar.bz2, as appropriate, your choice for
the compression
argument determines which string is
passed as the second argument to tarfile.TarFile.open
: 'w
', when you want no compression, or
'w:gz
' or 'w:bz2
' to get two kinds of
compression.
Class tarfile.TarFile
offers
several other classmethod
s, besides
open
, which you could use to
generate a suitable instance. I find open
handier and more flexible because it
takes the compression information as part of the mode
string argument. However, if you want
to ensure bzip2
compression is used
unconditionally, for example, you could choose to call classmethod
bz2open
instead.
Once we have an instance of class tarfile.TarFile
that is set to use the kind
of compression we desire, the instance’s method add
does all we require. In particular, when
string folder_to_backup
names a “directory” (or
folder), rather than an ordinary file, add
recursively adds all of the subtree
rooted in that directory. If on some other occasion, we wanted to
change this behavior to get precise control on what is archived, we
could pass to add
an additional
named argument recursive=False
to
switch off this implicit recursion. After calling add
, all that’s left for function
make_tar
to do is to close the TarFile
instance and return the path on
which the tar file has been
written, just in case the caller needs this information.
Library Reference docs on module tarfile
.
Credit: Hamish Lawson
That’s what the setmode
function, in the platform-dependent (Windows-only) msvcrt
module in the Python Standard
Library, is for:
import sys
if sys.platform == "win32":
import os, msvcrtmsvcrt.setmode(sys.stdout.fileno( ), os.O_BINARY)
You can now call sys.stdout.write
with any bytestring as the
argument, and the bytestring will go unmodified to standard
output.
While Unix doesn’t make (or need) a distinction between text and
binary modes, if you are reading or writing binary data, such as an
image, under Windows, the file must be opened in binary mode. This is
a problem for programs that write binary data to standard output (as a
CGI script, for example, could be expected to do), because Python
opens the sys.stdout
file object on
your behalf, normally in text mode.
You can have stdout
opened in
binary mode instead by supplying the -u
command-line option to the Python
interpreter. For example, if you know your CGI script will be running
under the Apache web server, as the first line of your script, you can
use something like:
#! c:/python23/python.exe -u
assuming you’re running under Python 2.3 with a standard
installation. Unfortunately, you may not always be able to control the
command line under which your script will be started. The approach
taken in this recipe’s “Solution” offers a workable alternative. The
setmode
function provided by the
Windows-specific msvcrt
module lets
you change the mode of stdout
’s
underlying file descriptor. By using this function, you can ensure
from within your program that sys.stdout
gets set to binary mode.
Documentation for the msvcrt
module in the Library Reference and
Python in a Nutshell.
Credit: Erik Max Francis
You like the C++ approach to I/O, based on ostream
s and
manipulators (special objects that cause
special effects on a stream when inserted in it) and want to use it in
your Python programs.
Python lets you overload operators by having your classes define
special methods (i.e., methods whose names start and end with two
underscores). To use <<
for
output, as you do in C++, you just need to code an output stream class
that defines the special method _ _lshift_
_
:
class IOManipulator(object): def _ _init_ _(self, function=None): self.function = function def do(self, output): self.function(output) def do_endl(stream): stream.output.write(' ') stream.output.flush( ) endl = IOManipulator(do_endl) class OStream(object): def _ _init_ _(self, output=None): if output is None: import sys output = sys.stdout self.output = output self.format = '%s' def _ _lshift_ _(self, thing): ''' the special method which Python calls when you use the << operator and the left-hand operand is an OStream ''' if isinstance(thing, IOManipulator): thing.do(self) else: self.output.write(self.format % thing) self.format = '%s' return self def example_main( ): cout = OStream( ) cout<< "The average of " << 1 << " and " << 3 << " is " << (1+3)/2 <<endl # emits The average of 1 and 3 is 4 if _ _name_ _ == '_ _main_ _': example_main( )
Wrapping Python file-like objects to emulate C++ ostreams syntax
is quite easy. This recipe shows how to code the insertion operator
<< for this purpose. The recipe also implements an IOManipulator
class (as in C++) to call
arbitrary functions on a stream upon insertion, and a predefined
manipulator endl
(guess where that
name comes from) to write a newline and flush the stream.
The reason class OStream
’s instances hold a
format
attribute and reset it to the default value
'%s
' after each self.output.write
is so that you can build
devious manipulators that temporarily save formatting state on the
stream object, such as:
def do_hex(stream):
stream.format = '%x'
hex = IOManipulator(do_hex)
cout << 23 << ' in hex is ' << hex << 23 << ', and in decimal ' << 23 << endl# emits 23 in hex is 17, and in decimal 23
Some people detest C++’s cout <<
something
syntax, some love it. In cases such as the example
given in the recipe, this syntax ends up simpler and more readable
than:
print>>somewhere, "The average of %d and %d is %f " % (1, 3, (1+3)/2)
which is the “Python-native” alternative (looking a lot like C in this case). It depends in part on whether you’re more used to C++ or to C. In any case, this recipe gives you a choice! Even if you don’t end up using this particular approach, it’s still interesting to see how simple operator overloading is in Python.
Library Reference and Python
in a Nutshell docs on file
objects and special methods such as
_ _lshift_ _
; Recipe 4.20 implements a
Python version of C’s printf
function.
Credit: Andrew Dalke
You need to make an input file object (with data coming from a socket or other input file handle) rewindable back to the beginning so you can read it over.
Wrap the file object into a suitable class:
from cStringIO import StringIO class RewindableFile(object): """ Wrap a file handle to allow seeks back to the beginning. """ def _ _init_ _(self, input_file): """ Wraps input_file into a file-like object with rewind. """ self.file = input_file self.buffer_file = StringIO( ) self.at_start = True try: self.start = input_file.tell( ) except (IOError, AttributeError): self.start = 0 self._use_buffer = True def seek(self, offset, whence=0): """ Seek to a given byte position. Must be: whence == 0 and offset == self.start """ if whence != 0: raise ValueError("whence=%r; expecting 0" % (whence,)) if offset != self.start: raise ValueError("offset=%r; expecting %s" % (offset, self.start)) self.rewind( ) def rewind(self): """ Simplified way to seek back to the beginning. """ self.buffer_file.seek(0) self.at_start = True def tell(self): """ Return the current position of the file (must be at start). """ if not self.at_start: raise TypeError("RewindableFile can't tell except at start of file") return self.start def _read(self, size): if size < 0: # read all the way to the end of the file y = self.file.read( ) if self._use_buffer: self.buffer_file.write(y) return self.buffer_file.read( ) + y elif size == 0: # no need to actually read the empty string return "" x = self.buffer_file.read(size) if len(x) < size: y = self.file.read(size - len(x)) if self._use_buffer: self.buffer_file.write(y) return x + y return x def read(self, size=-1): """ Read up to 'size' bytes from the file. Default is -1, which means to read to end of file. """ x = self._read(size) if self.at_start and x: self.at_start = False self._check_no_buffer( ) return x def readline(self): """ Read a line from the file. """ # Can we get it out of the buffer_file? s = self.buffer_file.readline( ) if s[-1:] == " ": return s # No, so read a line from the input file t = self.file.readline( ) if self._use_buffer: self.buffer_file.write(t) self._check_no_buffer( ) return s + t def readlines(self): """read all remaining lines from the file""" return self.read( ).splitlines(True) def _check_no_buffer(self): # If 'nobuffer' has been called and we're finished with the buffer file, # get rid of the buffer, redirect everything to the original input file. if not self._use_buffer and self.buffer_file.tell( ) == len(self.buffer_file.getvalue( )): # for top performance, we rebind all relevant methods in self for n in 'seek tell read readline readlines'.split( ): setattr(self, n, getattr(self.file, n, None)) del self.buffer_file def nobuffer(self): """tell RewindableFile to stop using the buffer once it's exhausted""" self._use_buffer = False
Sometimes, data coming from a socket or other input file handle isn’t what it was supposed to be. For example, suppose you are reading from a buggy server, which is supposed to return an XML stream, but sometimes returns an unformatted error message instead. (This scenario often occurs because many servers don’t handle incorrect input very well.)
This recipe’s RewindableFile
class helps you
solve this problem. r =
RewindableFile(f)
wraps the original input
stream f
into a “rewindable file” instance
r
which essentially mimics
f
’s behavior but also provides a buffer.
Read requests to r
are forwarded to
f
, and the data thus read gets appended to
a buffer, then returned to the caller. The buffer contains all the
data read so far.
r
can be told to
rewind
, meaning to seek back to the start position.
The next read request will come from the buffer, until the buffer has
been read, in which case it gets the data from the input stream again.
The newly read data is also appended to the buffer.
When buffering is no longer needed, call the
nobuffer
method of r
. This
tells r
that, once it’s done reading the
buffer’s current contents, it can throw the buffer away. After
nobuffer
is called, the behavior of seek
is no longer defined.
For example, suppose you have a server that gives either an
error message of the form ERROR: cannot do
that
, or an XML data stream, starting with '<?xml
‘...:
import RewindableFile
infile = urllib2.urlopen("http://somewhere/")
infile = RewindableFile.RewindableFile(infile)
s = infile.readline( )
if s.startswith("ERROR:"):
raise Exception(s[:-1])
infile.seek(0)
infile.nobuffer( ) # Don't buffer the data any more... process the XML from infile ...
One sometimes-useful Python idiom is not supported by the class
in this recipe: you can’t reliably stash away the bound methods of a
RewindableFile
instance. (If you don’t know what
bound methods are, no problem, of course, since in that case you
surely won’t want to stash them anywhere!). The reason for this
limitation is that, when the buffer is empty, the
RewindableFile
code reassigns the input file’s
read
, readlines
, etc., methods, as instance
variables of self
. This gives
slightly better performance, at the cost of not supporting the
infrequently-used idiom of saving bound methods. See Recipe 6.11 for another
example of a similar technique, where an instance irreversibly changes
its own methods.
The tell
method,
which gives the current location of a file, can be called on an
instance of RewindableFile
only right after wrapping,
and before any reading, to get the beginning byte location. The
RewindableFile
implementation of tell
tries to get the real position from the
wrapped file, and use that as the beginning location. If the wrapped
file does not support tell
, then
the RewindableFile
implementation of tell
just returns 0.
Site http://www.dalkescientific.com/Python/ for the
latest version of this recipe’s code; Library
Reference and Python in a Nutshell
docs on file
objects and module
cStringIO
; Recipe 6.11 for another
example of an instance affecting an irreversible behavior change on
itself by rebinding its methods.
Credit: Michael Kent
You need to pass a file-like object (e.g., the results
of a call such as urllib.urlopen
)
to a function or method that insists on receiving a true file object
(e.g., a function such as marshal.load
).
To cooperate with such type-checking, we need to write all data from the file-like object into a temporary file on disk. Then, we can use the (true) file object for that temporary disk file. Here’s a function that implements this idea:
import types, tempfile CHUNK_SIZE = 16 * 1024 def adapt_file(fileObj): if isinstance(fileObj, file): return fileObj tmpFileObj = tempfile.TemporaryFile while True: data = fileObj.read(CHUNK_SIZE) if not data: break tmpFileObj.write(data) fileObj.close( ) tmpFileObj.seek(0) return tmpFileObj
This recipe demonstrates an unusual Pythonic application
of the Adapter Design Pattern (i.e., what to do when you have an X and
you need a Y instead). While design patterns are most normally thought
of in an object-oriented way, and therefore implemented by writing
classes, nothing is intrinsically necessary about that. In this case,
for example, we don’t really need to introduce any new class, since
the adapt_file
function is obviously sufficient.
Therefore, we respect Occam’s Razor and do not introduce entities
without necessity.
One way or another, you should think in terms of adaptation, in preference to type testing, even when you need to rely on some lower-level utility that insists on precise types. Instead of raising an exception when you get passed an object that’s perfectly adequate save for the technicality of type membership, think of the possibility of adapting what you get passed to what you need. In this way, your code will be more flexible and more suitable for reuse.
Documentation on built-in file
objects, and modules tempfile
and marshal
, in the Library
Reference and Python in a
Nutshell.
Credit: Robin Parmar, Alex Martelli
You need to examine a “directory”, or an entire directory tree rooted in a certain directory, and iterate on the files (and optionally folders) that match certain patterns.
The generator os.walk
from the Python Standard Library module os
is sufficient for this task, but we can
dress it up a bit by coding our own function to wrap os.walk
:
import os, fnmatch def all_files(root, patterns='*', single_level=False, yield_folders=False): # Expand patterns from semicolon-separated string to list patterns = patterns.split(';') for path, subdirs, files in os.walk(root): if yield_folders: files.extend(subdirs) files.sort( ) for name in files: for pattern in patterns: if fnmatch.fnmatch(name, pattern): yield os.path.join(path, name) break if single_level: break
The standard directory tree traversal generator os.walk
is powerful, simple, and flexible.
However, as it stands, os.walk
lacks a few niceties that applications may need, such as selecting
files according to some patterns, flat (linear) looping on all files
(and optionally folders) in sorted order, and the ability to examine a
single directory (without entering its subdirectories). This recipe
shows how easily these kinds of features can be added, by wrapping
os.walk
into another simple
generator and using standard library module fnmatch
to check filenames for matches to
patterns.
The file patterns are possibly case-insensitive (that’s
platform-dependent) but otherwise Unix-style, as supplied by the
standard fnmatch
module, which this
recipe uses. To specify multiple patterns, join them with a semicolon.
Note that this means that semicolons themselves can’t be part of a
pattern.
For example, you can easily get a list of all Python and HTML
files in directory /tmp
or any
subdirectory thereof:
thefiles = list(all_files('/tmp', '*.py;*.htm;*.html'))
Should you just want to process these files’ paths one at a time
(e.g., print them, one per line), you do not need to build a list: you
can simply loop on the result of calling
all_files
:
for path in all_files('/tmp', '*.py;*.htm;*.html'): print path
If your platform is case-sensitive, alnd you want case-sensitive
matching, then you need to specify the patterns more laboriously,
e.g., '*.[Hh][Tt][Mm][Ll]
' instead
of just '*.html
‘.
Documentation for the os.path
module and the os.walk
generator,
as well as the fnmatch
module, in
the Library Reference and Python in
a Nutshell.
Credit: Julius Welby
You need to rename files throughout a subtree of directories, specifically changing the names of all files with a given extension so that they have a different extension instead.
Operating on all files of a whole subtree of directories is easy
enough with the os.walk
function
from Python’s standard library:
import os def swapextensions(dir, before, after): if before[:1] != '.': before = '.'+before thelen = -len(before) if after[:1] != '.': after = '.'+after for path, subdirs, files in os.walk(dir): for oldfile in files: if oldfile[thelen:] == before: oldfile = os.path.join(path, oldfile) newfile = oldfile[:thelen] + after os.rename(oldfile, newfile) if _ _name_ _=='_ _main_ _': import sys if len(sys.argv) != 4: print "Usage: swapext rootdir before after" sys.exit(100) swapextensions(sys.argv[1], sys.argv[2], sys.argv[3])
This recipe shows how to change the file extensions of all files in a specified directory, all of its subdirectories, all of their subdirectories, and so on. This technique is useful for changing the extensions of a whole batch of files in a folder structure, such as a web site. You can also use it to correct errors made when saving a batch of files programmatically.
The recipe is usable either as a module to be imported from any other, or as a script to run from the command line, and it is carefully coded to be platform-independent. You can pass in the extensions either with or without the leading dot (.), since the code in this recipe inserts that dot, if necessary. (As a consequence of this convenience, however, this recipe is unable to deal with files completely lacking any extension, including the dot; this limitation may be bothersome on Unix systems.)
The implementation of this recipe uses techniques that purists
might consider too low level—specifically by dealing mostly with
filenames and extensions by direct string manipulation, rather than by
the functions in module os.path
.
It’s not a big deal: using os.path
is fine, but using Python’s powerful string facilities to deal with
filenames is fine, too.
The author’s web page at http://www.outwardlynormal.com/python/swapextensions.htm.
Credit: Chui Tey
Given a search path (a string of directories with a separator in between), you need to find the first file along the path with the requested name.
Basically, you need to loop over the directories in the given search path:
import os def search_file(filename, search_path, pathsep=os.pathsep): """ Given a search path, find file with requested name """ for path in search_path.split(pathsep): candidate = os.path.join(path, filename) if os.path.isfile(candidate): return os.path.abspath(candidate) return None if _ _name_ _ == '_ _main_ _': search_path = '/bin' + os.pathsep + '/usr/bin' # ; on Windows, : on Unix find_file = search_file('ls', search_path) if find_file: print "File 'ls' found at %s" % find_file else: print "File 'ls' not found"
This recipe’s “Problem” is a reasonably frequent task, and Python makes resolving it extremely easy. Other recipes perform similar and related tasks: to find files specifically on Python’s own search path, see Recipe 2.20; to find all files matching a pattern along a search path, see Recipe 2.19.
The search loop can be coded in many ways, but returning the
path (made into an absolute path, for uniformity and convenience) as
soon as a hit is found is simplest as well as fast. The explicit
return None
after the loop is not
strictly needed, since None
is
returned by Python when a function falls off the end. Having the
return
statement explicitly there
in this case makes the functionality of search_file
much clearer at first sight.
Recipe 2.20;
Recipe 2.19;
documentation for the module os
in
the Library Reference and Python in
a Nutshell.
Credit: Bill McNeill, Andrew Kirkpatrick
Given a search path (i.e., a string of directories with a separator in between), you need to find all files along the path whose names match a given pattern.
Basically, you need to loop over the directories in the given search path. The loop is best encapsulated in a generator:
import glob, os def all_files(pattern, search_path, pathsep=os.pathsep): """ Given a search path, yield all files matching the pattern. """ for path in search_path.split(pathsep): for match in glob.glob(os.path.join(path, pattern)): yield match
One nice thing about generators is that you can easily use them
to obtain just the first item, all items, or anything in between. For
example, to print the first file matching '*.pye
' along your environment’s PATH
:
print all_files('*.pye', os.environ['PATH']).next( )
To print all such files, one per line:
for match in all_files('*.pye', os.environ['PATH']): print match
To print them all at once, as a list:
print list(all_files('*.pye', os.environ['PATH']))
I have also wrapped around this all_files
function a main script to show all of the files with a given name
along my PATH
. Thus I can see not
only which one will execute for that name (the first one), but also
which ones are “shadowed” by that first one:
if _ _name_ _ == '_ _main_ _': import sys if len(sys.argv) != 2 or sys.argv[1].startswith('-'): print 'Use: %s <pattern>' % sys.argv[0] sys.exit(1) matches = list(all_files(sys.argv[1], os.environ['PATH'])) print '%d match:' % len(matches) for match in matches: print match
Recipe 2.18 for
a simpler approach to find the first file with a specified name along
the path; Library Reference and
Python in a Nutshell docs for modules os
and glob
.
Credit: Mitch Chapman
A large Python application includes resource files (e.g., Glade project files, SQL templates, and images) as well as Python packages. You want to store these associated files together with the Python packages that use them.
You need to be able to look for either files or directories
along Python’s sys.path
:
import sys, os class Error(Exception): pass def _find(pathname, matchFunc=os.path.isfile): for dirname in sys.path: candidate = os.path.join(dirname, pathname) if matchFunc(candidate): return candidate raise Error("Can't find file %s" % pathname) def findFile(pathname): return _find(pathname) def findDir(path): return _find(path, matchFunc=os.path.isdir)
Larger Python applications consist of sets of Python packages and associated sets of resource files. It’s convenient to store these associated files together with the Python packages that use them, and it’s easy to do so if you use this variation on the previous Recipe 2.18 to find files or directories with pathnames relative to the Python search path.
Recipe 2.18;
documentation for the os
module in
the Library Reference and Python in
a Nutshell.
Credit: Robin Parmar
Modules must be on the Python search path before they can be imported, but you don’t want to set a huge permanent path because that slows performance—so, you want to change the path dynamically.
We simply conditionally add a “directory” to Python’s sys.path
, carefully checking to avoid
duplication:
def AddSysPath(new_path):
""" AddSysPath(new_path): adds a "directory" to Python's sys.path
Does not add the directory if it does not exist or if it's already on
sys.path. Returns 1 if OK, -1 if new_path does not exist, 0 if it was
already on sys.path.
"""
import sys, os
# Avoid adding nonexistent paths
if not os.path.exists(new_path): return -1
# Standardize the path. Windows is case-insensitive, so lowercase
# for definiteness if we are on Windows.
new_path = os.path.abspath(new_path)
if sys.platform == 'win32':
new_path = new_path.lower( )
# Check against all currently available paths
for x in sys.path:
x = os.path.abspath(x)
if sys.platform == 'win32':
x = x.lower( )
if new_path in (x, x + os.sep):
return 0sys.path.append(new_path)
# if you want the new_path to take precedence over existing
# directories already in sys.path, instead of appending, use:
# sys.path.insert(0, new_path)
return 1
if _ _name_ _ == '_ _main_ _':
# Test and show usage
import sys
print 'Before:'
for x in sys.path: print x
if sys.platform == 'win32':
print AddSysPath('c:\Temp')
print AddSysPath('c:\temp')
else:
print AddSysPath('/usr/lib/my_modules')
print 'After:'
for x in sys.path: print x
Modules must be in directories that are on the Python search
path before they can be imported, but we don’t want to have a huge
permanent path because doing so slows down every import performed by
every Python script and application. This simple recipe dynamically
adds a “directory” to the path, but only if that directory exists and
was not already on sys.path
.
sys.path
is a list, so it’s
easy to add directories to its end, using sys.path.append
. Every import
performed after such an append
will automatically look in the newly
added directory if it cannot be satisfied from earlier ones. As
indicated in the Solution, you can alternatively use sys.path.insert(0, . .
. so that the newly
added directory is searched before ones that were
already in sys.path
.
It’s no big deal if sys.path
ends up with some duplicates or if a nonexistent directory is
accidentally appended to it; Python’s import
statement is clever enough to shield
itself against such issues. However, each time such a problem occurs
at import time (e.g., from duplicate unsuccessful searches, errors
from the operating system that need to be handled gracefully, etc.), a
small price is paid in terms of performance. To avoid uselessly paying
such a price, this recipe does a conditional addition to sys.path
, never appending any directory that
doesn’t exist or is already in sys.path
. Directories appended by this
recipe stay in sys.path
only for
the duration of this program’s run, just like any other dynamic
alteration you might do to sys.path
.
Documentation for the sys
and
os.path
modules in the
Library Reference and Python in a
Nutshell.
Credit: Cimarron Taylor, Alan Ezust
You need to know the relative path from one directory to another—for example, to create a symbolic link or a relative reference in a URL.
The simplest approach is to split paths into lists of directories, then work on the lists. Using a couple of auxiliary and somewhat generic helper functions, we could code:
import os, itertools def all_equal(elements): ''' return True if all the elements are equal, otherwise False. ''' first_element = elements[0] for other_element in elements[1:]: if other_element != first_element: return False return True def common_prefix(*sequences): ''' return a list of common elements at the start of all sequences, then a list of lists that are the unique tails of each sequence. ''' # if there are no sequences at all, we're done if not sequences: return [ ], [ ] # loop in parallel on the sequences common = [ ] for elements in itertools.izip(*sequences): # unless all elements are equal, bail out of the loop if not all_equal(elements): break # got one more common element, append it and keep looping common.append(elements[0]) # return the common prefix and unique tails return common, [ sequence[len(common):] for sequence in sequences ] def relpath(p1, p2, sep=os.path.sep, pardir=os.path.pardir): ''' return a relative path from p1 equivalent to path p2. In particular: the empty string, if p1 == p2; p2, if p1 and p2 have no common prefix. ''' common, (u1, u2) = common_prefix(p1.split(sep), p2.split(sep)) if not common: return p2 # leave path absolute if nothing at all in common return sep.join( [pardir]*len(u1) + u2 ) def test(p1, p2, sep=os.path.sep): ''' call function relpath and display arguments and results. ''' print "from", p1, "to", p2, " -> ", relpath(p1, p2, sep) if _ _name_ _ == '_ _main_ _': test('/a/b/c/d', '/a/b/c1/d1', '/') test('/a/b/c/d', '/a/b/c/d', '/') test('c:/x/y/z', 'd:/x/y/z', '/')
The workhorse in this recipe is the simple but very general
function common_prefix
, which, given any
N
sequences, returns their common prefix
and a list of their respective unique tails. To compute the relative
path between two given paths, we can ignore their common prefix. We
need only the appropriate number of move-up markers (normally,
os.path.pardir
—e.g., ../
on Unix-like systems; we need as many of
them as the length of the unique tail of the starting path) followed
by the unique tail of the destination path. So, function
relpath
splits the paths into lists of directories,
calls common_prefix
, and then performs exactly the
construction just described.
common_prefix
centers on the loop for elements in itertools.izip(*sequences)
,
relying on the fact that izip
ends
with the shortest of the iterables it’s zipping. The body of the loop
only needs to prematurely terminate the loop as soon as it meets a
tuple of elements (coming one from each sequence, per izip
’s specifications) that aren’t all
equal, and to keep track of the elements that are
equal by appending one of them to list common
. Once
the loop is done, all that’s left to prepare the lists to return is to
slice off the elements that are already in common
from the front of each of the sequences.
Function all_equal
could alternatively be
implemented in a completely different way, less simple and obvious,
but interesting:
def all_equal(elements): return len(dict.fromkeys(elements)) == 1
or, equivalently and more concisely, in Python 2.4 only,
def all_equal(elements): return len(set(elements)) == 1
Saying that all elements are equal is exactly the same
as saying that the set
of the
elements has cardinality (length) one. In the variation using dict.fromkeys
, we use a dict
to represent the set
, so that variation works in Python 2.3
as well as in 2.4. The variation using set
is clearer, but it only works in Python
2.4. (You could also make it work in version 2.3, as well as Python
2.4, by using the standard Python library module sets
).
Library Reference and Python
in a Nutshell docs for modules os
and itertools
.
Credit: Danny Yoo
Your application needs to read single characters, unbuffered, from standard input, and it needs to work on both Windows and Unix-like systems.
When we need a cross-platform solution, starting with platform-dependent ones, we need to wrap the different solutions so that they look the same:
try: from msvcrt import getch except ImportError: ''' we're not on Windows, so we try the Unix-like approach ''' def getch( ): import sys, tty, termios fd = sys.stdin.fileno( ) old_settings = termios.tcgetattr(fd) try: tty.setraw(fd) ch = sys.stdin.read(1) finally: termios.tcsetattr(fd, termios.TCSADRAIN, old_settings) return ch
On Windows, the standard Python library module msvcrt
offers the handy getch
function to read one character,
unbuffered, from the keyboard, without echoing it to the screen.
However, this module is not part of the standard Python library on
Unix and Unix-like platforms, such as Linux and Mac OS X. On such
platforms, we can get the same functionality with the tty
and termios
modules of the standard Python
library (which, in turn, are not present on Windows).
The key point is that in application-level code, we should never have to worry about such issues; rather, we should write our application code in platform-independent ways, counting on library functions to paper over the differences between platforms. The Python Standard Library fulfills that role admirably for most tasks, but not all, and the problem posed by this recipe is an example of one for which the Python Standard Library doesn’t directly supply a cross-platform solution.
When we can’t find a ready-packaged cross-platform solution in
the standard library, we should package it anyway as part of our own
additional custom library. This recipe’s Solution, besides solving the
specific task of the recipe, also shows one good general way to go
about such packaging. (Alternatively, you can test sys.platform
, but I prefer the approach
shown in this recipe.)
Your own library module should try to import
the standard library module it needs
on a certain platform within a try
clause and include a corresponding except
ImportError
clause that is triggered when
the module is running on a different platform. In the body of that
except
clause, your own library
module can apply whatever alternate approach will work on the
different platform. In some rare cases, you may need more than two
platform-dependent approaches, but most often you’ll need one approach
on Windows and only one other approach to cover all other platforms.
This is because most non-Windows platforms today are generally Unix or
Unix-like.
Library Reference and Python
in a Nutshell docs for msvcrt
, tty
, and termios
.
Credit: Dinu Gherman, Dan Wolfe
You’re running on a reasonably recent version of Mac OS X (version 10.3 “Panther” or later), and you need to know the number of pages in a PDF document.
The PDF format and Python are both natively integrated with Mac OS X (10.3 or later), and this allows a rather simple solution:
#!/usr/bin python import CoreGraphics def pageCount(pdfPath): "Return the number of pages for the PDF document at the given path." pdf = CoreGraphics.CGPDFDocumentCreateWithProvider( CoreGraphics.CGDataProviderCreateWithFilename(pdfPath) ) return pdf.getNumberOfPages( ) if _ _name_ _ == '_ _main_ _': import sys for path in sys.argv[1:]: print pageCount(path)
A reasonable alternative to this recipe might be to use the
PyObjC
Python extension, which
(among other wonders) lets Python code reuse all the power in the
Foundation
and AppKit
frameworks that come with Mac OS X.
Such a choice would let you write a Python script that is also able to
run on older versions of Mac OS X, such as 10.2 Jaguar. However,
relying on Mac OS X 10.3 or later ensures we can use the Python
installation that is integrated as a part of the operating system, as
well as such goodies as the CoreGraphics
Python extension module (also
part of Mac OS X “Panther”) that lets your Python code reuse Apple’s
excellent Quartz
graphics engine
directly.
PyObjC
is at http://pyobjc.sourceforge.net/; information on
the CoreGraphics
module is at
http://www.macdevcenter.com/pub/a/mac/2004/03/19/core_graphics.html.
You need to set the attributes of a file on Windows; for example, you may need to set the file as read-only, archived, and so on.
PyWin32’s win32api
module offers a function SetFileAttributes
that makes this task quite
simple:
import win32con, win32api, os # create a file, just to show how to manipulate it thefile = 'test' f = open('test', 'w') f.close( ) # to make the file hidden...: win32api.SetFileAttributes(thefile, win32con.FILE_ATTRIBUTE_HIDDEN) # to make the file readonly: win32api.SetFileAttributes(thefile, win32con.FILE_ATTRIBUTE_READONLY) # to be able to delete the file we need to set it back to normal: win32api.SetFileAttributes(thefile, win32con.FILE_ATTRIBUTE_NORMAL) # and finally we remove the file we just made os.remove(thefile)
One interesting use of win32api.SetFileAttributes
is to enable a
file’s removal. Removing a file with os.remove
can fail on Windows if the file’s
attributes are not normal. To get around this problem, you just need
to use the Win32 call to SetFileAttributes
to convert it to a normal
file, as shown at the end of this recipe’s Solution. Of course, this
should be done with caution, since there may be a good reason the file
is not “normal”. The file should be removed only if you know what
you’re doing!
The documentation on the win32file
module at http://ASPN.ActiveState.com/ASPN/Python/Reference/Products/ActivePython/PythonWin32Extensions/win32file.html.
You need to extract the text content (with or without the attending XML markup) from an OpenOffice.org document.
An OpenOffice.org document is just a zip file that aggregates XML documents according to a well-documented standard. To access our precious data, we don’t even need to have OpenOffice.org installed:
import zipfile, re rx_stripxml = re.compile("<[^>]*?>", re.DOTALL|re.MULTILINE) def convert_OO(filename, want_text=True): """ Convert an OpenOffice.org document to XML or text. """ zf = zipfile.ZipFile(filename, "r") data = zf.read("content.xml") zf.close( ) if want_text: data = " ".join(rx_stripxml.sub(" ", data).split( )) return data if _ _name_ _=="_ _main_ _": import sys if len(sys.argv)>1: for docname in sys.argv[1:]: print 'Text of', docname, ':' print convert_OO(docname) print 'XML of', docname, ':' print convert_OO(docname, want_text=False) else: print 'Call with paths to OO.o doc files to see Text and XML forms.'
OpenOffice.org documents are zip files, and in addition to other contents, they always contain the file content.xml. This recipe’s job, therefore, essentially boils down to just extracting this file. By default, the recipe then throws away XML tags with a simple regular expression, splits the result by whitespace, and joins it up again with a single blank to save space. Of course, we could use an XML parser to get information in a vastly richer and more structured way, but if all we need is the rough textual content, this fast, rough-and-ready approach may suffice.
Specifically, the regular expression
rx_stripxml
matches any XML tag (opening or closing)
from the leading <
to the
terminating >
. Inside function
convert_OO
, in the statements guarded by if want_text
, we use that regular expression
to change every XML tag into a space, then normalize whitespace by
splitting (i.e., calling the string method split
, which splits on any sequence of
whitespace), and rejoining (with " ".join
, to use a single blank character as
the joiner). Essentially, this split-and-rejoin process changes any
sequence of whitespace into a single blank character. More advanced
ways to extract all text from an XML document are shown in Recipe 12.3.
Library Reference docs on modules
zipfile
and re
; OpenOffice.org’s web site, http://www.openoffice.org/;
Recipe 12.3.
Credit: Simon Brunning, Pavel Kosina
You want to extract the text content from each Microsoft Word document in a directory tree on Windows into a corresponding text file.
With the PyWin32 extension, we can access Word itself, through COM, to perform the conversion:
import fnmatch, os, sys, win32com.client wordapp = win32com.client.gencache.EnsureDispatch("Word.Application") try: for path, dirs, files in os.walk(sys.argv[1]): for filename in files: if not fnmatch.fnmatch(filename, '*.doc'): continue doc = os.path.abspath(os.path.join(path, filename)) print "processing %s" % doc wordapp.Documents.Open(doc) docastxt = doc[:-3] + 'txt' wordapp.ActiveDocument.SaveAs(docastxt, FileFormat=win32com.client.constants.wdFormatText) wordapp.ActiveDocument.Close( ) finally: # ensure Word is properly shut down even if we get an exception wordapp.Quit( )
A useful aspect of most Windows applications is that you can
script them via COM, and the PyWin32 extension makes it fairly easy to
perform COM scripting from Python. The extension enables you to write
Python scripts to perform many kinds of Window tasks. The script in
this recipe’s Solution drives Microsoft Word to extract the text from
every .doc file in a “directory”
tree into a corresponding .txt
text file. Using the os.walk
function, we can access every subdirectory in a tree with a simple
for
statement, without recursion.
With the fnmatch.fnmatch
function,
we can check a filename to determine whether it matches an appropriate
wildcard, here '*.doc
‘. Once we
have determined the name of a Word document file, we process that name
with functions from os.path
to turn
it into a complete absolute path, and have Word open it, save it as
text, and close it again.
If you don’t have Word, you may need to take a completely different approach. One possibility is to use OpenOffice.org, which is able to load Word documents. Another is to use a program specifically designed to read Word documents, such as Antiword, found at http://www.winfield.demon.nl/. However, we have not explored these alternative options.
Mark Hammond, Andy Robinson, Python Programming on
Win32 (O’Reilly), for documentation on PyWin32;
http://msdn.microsoft.com, for Microsoft’s
documentation of the object model of Microsoft Word;
Library Reference and Python in a
Nutshell sections on modules fnmatch
and os.path
, and function os.walk
.
Credit: Jonathan Feinberg, John Nielsen
You need to lock files in a program that runs on both Windows and Unix-like systems, but the Python Standard Library offers only platform-specific ways to lock files.
When the Python Standard Library itself doesn’t offer a cross-platform solution, it’s often possible to implement one ourselves:
import os # needs win32all to work on Windows (NT, 2K, XP, _not_ /95 or /98) if os.name == 'nt': import win32con, win32file, pywintypes LOCK_EX = win32con.LOCKFILE_EXCLUSIVE_LOCK LOCK_SH = 0 # the default LOCK_NB = win32con.LOCKFILE_FAIL_IMMEDIATELY _ _overlapped = pywintypes.OVERLAPPED( ) def lock(file, flags): hfile = win32file._get_osfhandle(file.fileno( ))win32file.LockFileEx(hfile, flags, 0, 0xffff0000, _ _overlapped) def unlock(file): hfile = win32file._get_osfhandle(file.fileno( )) win32file.UnlockFileEx(hfile, 0, 0xffff0000, _ _overlapped) elif os.name == 'posix': from fcntl import LOCK_EX, LOCK_SH, LOCK_NB def lock(file, flags): fcntl.flock(file.fileno( ), flags) def unlock(file): fcntl.flock(file.fileno( ), fcntl.LOCK_UN) else: raise RuntimeError("PortaLocker only defined for nt and posix platforms")
When multiple programs or threads have to access a shared file, it’s wise to ensure that accesses are synchronized so that two processes don’t try to modify the file contents at the same time. Failure to synchronize accesses could even corrupt the entire file in some cases.
This recipe supplies two functions,
lock
and unlock
, that request and
release locks on a file, respectively. Using the portalocker.py
module is a simple matter of
calling the lock
function and passing in the file and
an argument specifying the kind of lock that is desired:
This lock denies all processes, including the process that first locks the file, write access to the file. All processes can read the locked file.
This denies all other processes both read and write access to the file.
When this value is specified, the function returns
immediately if it is unable to acquire the requested lock.
Otherwise, it waits. LOCK_NB
can be OR
ed with either
LOCK_SH
or LOCK_EX
by using Python’s bitwise-or
operator, the vertical bar (|).
For example:
import portalocker afile = open("somefile", "r+") portalocker.lock(afile, portalocker.LOCK_EX)
The implementation of the lock
and
unlock
functions is entirely different on different
systems. On Unix-like systems (including Linux and Mac OS X), the
recipe relies on functionality made available by the standard fcntl
module. On Windows systems (NT, 2000,
XP—it doesn’t work on old Win/95 and Win/98 platforms because they
just don’t have the needed oomph in the operating system!), the recipe
uses the win32file
module, part of
the very popular PyWin32
package of
Windows-specific extensions to Python, authored by Mark Hammond. But
the important point is that, despite the differences in
implementation, the functions (and the flags you can pass to the
lock
function) are made to behave in the same way
across platforms. Such cross-platform packaging of differently
implemented but equivalent functionality enables you to easily write
cross-platform applications, which is one of Python’s
strengths.
When you write a cross-platform program, it’s nice if the
functionality that your program uses is, in turn, encapsulated in a
cross-platform way. For file locking in particular, it is especially
helpful to Perl users, who are used to an essentially transparent
lock
system call across platforms.
More generally, if os.name==
just
does not belong in application-level code. Such platform testing
ideally should always be in the standard library or an
application-independent module, as it is here.
Documentation on the fcntl
module in the Library Reference; documentation
on the win32file
module at
http://ASPN.ActiveState.com/ASPN/Python/Reference/Products/ActivePython/PythonWin32Extensions/win32file.html;
Jonathan Feinberg’s web site (http://MrFeinberg.com).
Credit: Robin Parmar, Martin Miller
You want to make a backup copy of a file, before you overwrite it, with the standard convention of appending a three-digit version number to the name of the old file.
We just need to code a function to perform the backup copy appropriately:
def VersionFile(file_spec, vtype='copy'): import os, shutil if os.path.isfile(file_spec): # check the 'vtype' parameter if vtype not in ('copy', 'rename'): raise ValueError, 'Unknown vtype %r' % (vtype,) # Determine root filename so the extension doesn't get longer n, e = os.path.splitext(file_spec) # Is e a three-digits integer preceded by a dot? if len(e) == 4 and e[1:].isdigit( ): num = 1 + int(e[1:]) root = n else: num = 0 root = file_spec # Find next available file version for i in xrange(num, 1000): new_file = '%s.%03d' % (root, i) if not os.path.exists(new_file): if vtype == 'copy': shutil.copy(file_spec, new_file) else: os.rename(file_spec, new_file) return True raise RuntimeError, "Can't %s %r, all names taken"%(vtype,file_spec) return False if _ _name_ _ == '_ _main_ _': import os # create a dummy file 'test.txt' tfn = 'test.txt' open(tfn, 'w').close( ) # version it 3 times print VersionFile(tfn) # emits: True print VersionFile(tfn) # emits: True print VersionFile(tfn) # emits: True # remove all test.txt* files we just made for x in ('', '.000', '.001', '.002'): os.unlink(tfn + x) # show what happens when the file does not exist print VersionFile(tfn) # emits: False print VersionFile(tfn) # emits: False
The purpose of the VersionFile
function is to
ensure that an existing file is copied (or renamed, as indicated by
the optional second parameter) before you open it for writing or
updating and therefore modify it. It is polite to make such backups of
files before you mangle them (one functionality some people still pine
for from the good old VMS operating system, which performed it
automatically!). The actual copy or renaming is performed by
shutil.copy
and os.rename
,
respectively, so the only issue is which name to use as the
target.
A popular way to determine backups’ names is versioning (i.e.,
appending to the filename a gradually incrementing number). This
recipe determines the new name by first extracting the filename’s root
(just in case you call it with an already-versioned filename) and then
successively appending to that root the further extensions .000, .001, and so on, until a name built in this
manner does not correspond to any existing file. Then, and only then,
is the name used as the target of a copy or renaming. Note that
VersionFile
is limited to 1,000 versions, so you
should have an archive plan after that. The file must exist before it
is first versioned—you cannot back up what does not yet exist.
However, if the file doesn’t exist, function
VersionFile
simply returns False
(while it returns True
if the file exists and has been
successfully versioned), so you don’t need to check before calling
it!
Documentation for the os
and
shutil
modules in the
Library Reference and Python in a
Nutshell.
Credit: Gian Paolo Ciceri
You need to ensure the integrity of some data by computing the data’s cyclic redundancy check (CRC), and you need to do so according to the CRC-64 specifications of the ISO-3309 standard.
The Python Standard Library does not include any implementation
of CRC-64 (only one of CRC-32 in function zlib.crc32
), so we need to program it
ourselves. Fortunately, Python can perform bitwise operations
(masking, shifting, bitwise-and, bitwise-or, xor, etc.) just as well
as, say, C (and, in fact, with just about the same syntax), so it’s
easy to transliterate a typical reference implementation of CRC-64
into a Python function as follows:
# prepare two auxiliary tables tables (using a function, for speed), # then remove the function, since it's not needed any more: CRCTableh = [0] * 256 CRCTablel = [0] * 256 def _inittables(CRCTableh, CRCTablel, POLY64REVh, BIT_TOGGLE): for i in xrange(256): partl = i parth = 0L for j in xrange(8): rflag = partl & 1L partl >>= 1L if parth & 1: partl ^= BIT_TOGGLE parth >>= 1L if rflag: parth ^= POLY64REVh CRCTableh[i] = parth CRCTablel[i] = partl # first 32 bits of generator polynomial for CRC64 (the 32 lower bits are # assumed to be zero) and bit-toggle mask used in _inittables POLY64REVh = 0xd8000000L BIT_TOGGLE = 1L << 31L # run the function to prepare the tables _inittables(CRCTableh, CRCTablel, POLY64REVh, BIT_TOGGLE) # remove all names we don't need any more, including the function del _inittables, POLY64REVh, BIT_TOGGLE # this module exposes the following two functions: crc64, crc64digest def crc64(bytes, (crch, crcl)=(0,0)): for byte in bytes: shr = (crch & 0xFF) << 24 temp1h = crch >> 8L temp1l = (crcl >> 8L) | shr tableindex = (crcl ^ ord(byte)) & 0xFF crch = temp1h ^ CRCTableh[tableindex] crcl = temp1l ^ CRCTablel[tableindex] return crch, crcl def crc64digest(aString): return "%08X%08X" % (crc64(bytes)) if _ _name_ _ == '_ _main_ _': # a little test/demo, for when this module runs as main-script assert crc64("IHATEMATH") == (3822890454, 2600578513) assert crc64digest("IHATEMATH") == "E3DCADD69B01ADD1" print 'crc64: dumb test successful'
Cyclic redundancy checks (CRCs) are a popular way to ensure that data (in particular, a file) has not been accidentally damaged. CRCs can readily detect accidental damage, but they are not intended to withstand inimical assault the way other cryptographically strong checksums are. CRCs can be computed much faster than other kinds of checksums, making them useful in those cases where the only damage we need to guard against is accidental damage, rather than deliberate adversarial tampering.
Mathematically speaking, a CRC is computed as a polynomial over the bits of the data we’re checksumming. In practice, as this recipe shows, most of the computation can be done once and for all and summarized in tables that, when properly indexed, give the contribution of each byte of input data to the result. So, after initialization (which we do with an auxiliary function because computation in Python is much faster when using a function’s local variables than when using globals), actual CRC computation is quite fast. Both the computation of the tables and their use for CRC computation require a lot of bitwise operations, but, fortunately, Python’s just as good at such operations as other languages such as C. (In fact, Python’s syntax for the various bitwise operands is just about the same as C’s.)
The algorithm to compute the standard CRC-64 checksum is
described in the ISO-3309 standard, and this recipe does nothing more
than implement that algorithm. The generator polynomial is x64 + x4 + x3 + x + 1
. (The “See Also”
section within this recipe provides a reference for obtaining
information about the computation.)
We represent the 64-bit result as a pair of Python int
s, holding the low and high 32-bit halves
of the result. To allow the CRC to be computed incrementally, in those
cases where the data comes in a little at a time, we let the caller of
function crc64
optionally feed in the “initial value”
for the (crch, crcl)
pair,
presumably obtained by calling crc64
on previous
parts of the data. To compute the CRC in one gulp, the caller just
needs to pass in the data (a string of bytes), since in this case, we
initialize the result to (0, 0)
by
default.
W.H. Press, S.A. Teukolsky, W.T. Vetterling, and B.P. Flannery, Numerical Recipes in C, 2d ed. (Cambridge University Press), pp. 896ff.