Credit: Paul F. Dubois, Ph.D., Program for Climate Model Diagnosis and Intercomparison, Lawrence Livermore National Laboratory
This chapter was originally meant to cover mainly topics such as lexing, parsing, and code generation—the classic issues of programs that are about programs. It turns out, however, that Pythonistas did not post many recipes about such tasks, focusing more on highly Python-specific topics such as program introspection, dynamic importing, and generation of functions by closure. Many of those recipes, we decided, were more properly located in various other chapters—on shortcuts, debugging, object oriented programming, algorithms, metaprogramming, and specific areas such as the handling of text, files, and persistence Therefore, you will find those topics covered in other chapters. In this chapter, we included only those recipes that are still best described as programs about programs. Of these, probably the most important one is that about currying, the creation of new functions by predetermining some arguments of other functions.
This arrangement doesn’t mean that the classic issues aren’t important! Python has extensive facilities related to lexing and parsing, as well as a large number of user-contributed modules related to parsing standard languages, which reduces the need for doing your own programming. If Pythonistas are not using these tools, then, in this one area, they are doing more work than they need to. Lexing and parsing are among the most common of programming tasks, and as a result, both are the subject of much theory and much prior development. Therefore, in these areas more than most, you will often profit if you take the time to search for solutions before resorting to writing your own. This Introduction contains a general guide to solving some common problems in these categories to encourage reusing the wide base of excellent, solid code and theory in these fields.
Lexing is the process of dividing an input stream into meaningful units, known as tokens, which are then processed. Lexing occurs in tasks such as data processing and in tools for inspecting and modifying text.
The regular expression facilities in Python are extensive and highly evolved, so your first consideration for a lexing task is often to determine whether it can be formulated using regular expressions. Also, see the next section about parsers for common languages and how to lex those languages.
The Python Standard Library tokenize
module splits an input stream into
Python-language tokens. Since Python’s tokenization rules are similar
to those of many other languages, this module may often be suitable
for other tasks, perhaps with a modest amount of pre- and/or
post-processing around tokenize
’s
own operations. For more complex tokenization tasks, Plex, http://nz.cosc.canterbury.ac.nz/~greg/python/Plex/,
can ease your efforts considerably.
At the other end of the lexing complexity spectrum, the built-in
string method split
can also be
used for many simple cases. For example, consider a file consisting of
colon-separated text fields, with one record per line. You can read a
line from the file as follows:
fields = line.split(':')
This produces a list of the fields. At this point, if you want to eliminate spurious whitespace at the beginning and ends of the fields, you can remove it:
fields = [f.strip( ) for f in fields]
For example:
>>> x = "abc :def:ghi : klm " >>> fields = x.split(':') >>> print fields['abc ', 'def', 'ghi ', ' klm ']
>>> print [f.strip( ) for f in fields]['abc', 'def', 'ghi', 'klm']
Do not elaborate on this example: do not try to over-enrich
simple code to perform lexing and parsing tasks which are in fact
quite hard to perform with generality, solidity, and good performance,
and for which much excellent, reusable code exists. For parsing
typical comma-separated values files, or files using other delimiters,
study the standard Python library module csv
. The ScientificPython package,
http://starship.python.net/~hinsen/ScientificPython/,
includes a module for reading and writing with Fortran-like formats,
and other such precious I/O modules, in the Scientific.IO
sub-package.
A common “gotcha” for beginners is that, while lexing and other
text-parsing techniques can be used to read numerical data from a
file, at the end of this stage, the entries are text strings, not
numbers. The int
and float
built-in functions are frequently
needed here, to turn each field from a string into a number:
>>> x = "1.2, 2.3, 4, 5.6"
>>> print [float(y.strip( )) for y in x.split(',')][1.2, 2.2999999999999998, 4.0, 5.5999999999999996]
Parsing refers to discovering semantic meaning from a series of tokens according to the rules of a grammar. Parsing tasks are quite ubiquitous. Programming tools may attempt to discover information about program texts or to modify such texts to fit a task. (Python’s introspection capabilities come into play here, as we will discuss later.) Little languages is the generic name given to application-specific languages that serve as human-readable forms of computer input. Such languages can vary from simple lists of commands and arguments to full-blown languages.
The grammar in the previous lexing example was implicit: the
data you need is organized as one line per record with the fields
separated by a special character. The “parser” in that case was
supplied by the programmer reading the lines from the file and
applying the simple split
method to
obtain the information. This sort of input file can easily grow,
leading to requests for a more elaborate form. For example, users may
wish to use comments, blank lines, conditional statements, or
alternate forms. While most such parsing can be handled with simple
logic, at some point, it becomes so complicated that it is much more
reliable to use a real grammar.
There is no hard-and-fast way to decide which part of the job is a lexing task and which belongs to the grammar. For example, comments can often be discarded in the lexing, but doing so is not wise in a program-transformation tool that must produce output containing the original comments.
Your strategy for parsing tasks can include:
Using a parser for that language from the Python Standard Library.
Using a parser from the user community. You can often find one by visiting the Vaults of Parnassus site, http://www.vex.net/parnassus/, or by searching the Python site, http://www.python.org.
Generating a parser using a parser generator.
Using Python itself as your input language.
A combination of approaches is often fruitful. For example, a simple parser can turn input into Python-language statements, which Python then executes in concert with a supporting package that you supply.
A number of parsers for specific languages exist in the standard library, and more are out there on the Web, supplied by the user community. In particular, the standard library includes parsing packages for XML, HTML, SGML, command-line arguments, configuration files, and for Python itself. For the now-ubiquitous task of parsing XML specifically, this cookbook includes a chapter—Chapter 14 , specifically dedicated to XML.
You do not have to parse C to connect C routines to Python. Use SWIG (http://www.swig.org). Likewise, you do not need a Fortran parser to connect Fortran and Python. See the Numerical Python web page at http://www.pfdubois.com/numpy/ for further information. Again, this cookbook includes a chapter, Chapter 17 , which is dedicated to these kind of tasks.
PLY and SPARK are two rich, solid, and mature Python-based parser generators. That is, they take as their input some statements that describe the grammar to be parsed and generate the parser for you. To make a useful tool, you must add the semantic actions to be taken when a certain construct in the grammar is recognized.
PLY (http://systems.cs.uchicago.edu/ply) is a Python implementation of the popular Unix tool yacc. SPARK (http://pages.cpcc.ucalgary-ca/~aycoch/spart/content.html) parses a more general set of grammars than yacc. Both tools use Python introspection, including the idea of placing grammar rules in functions’ docstrings.
Parser generators are one of the many application areas that may have even too many excellent tools, so that you may end up frustrated by having to pick just one. Besides SPARK and PLY, other Python tools in this field include TPG (Toy Parser Generator), DParser, PyParsing, kwParsing (or kyParsing), PyLR, Yapps, PyGgy, mx.TextTools and its SimpleParse frontend—too many to provide more than a bare mention of each, so, happy googling!
The chief problem in using any of these tools is that you need to educate yourself about grammars and learn to write them. A novice without any computer science background will encounter some difficulty except with very simple grammars. A lot of literature is available to teach you how to use yacc, and most of this knowledge will help you use SPARK and most of the others just as well.
If you are interested in this area, the penultimate reference is Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman, Compilers (Addison-Wesley), affectionately known as “the Dragon Book” to generations of computer science majors.[1]
Python itself can be used to create many application-specific languages. By writing suitable classes, you can rapidly create a program that is easy to get running, yet is extensible later. Suppose I want a language to describe graphs. Nodes have names, and edges connect the nodes. I want a way to input such graphs, so that after reading the input I will have the data structures in Python that I need for any further processing. So, for example:
nodes = { } def getnode(name): " Return the node with the given name, creating it if necessary. " if name in nodes: node = nodes[name] else: node = nodes[name] = node(name) return node class node(object): " A node has a name and a list of edges emanating from it. " def _ _init_ _(self, name): self.name = name self.edgelist = [ ] class edge(object): " An edge connects two nodes. " def _ _init_ _(self, name1, name2): self.nodes = getnode(name1), getnode(name2) for n in self.nodes: n.edgelist.append(self) def _ _repr_ _(self): return self.nodes[0].name + self.nodes[1].name
Using just these simple statements, I can now parse a list of
edges that describe a graph, and afterwards, I will now have data
structures that contain all my information. Here, I enter a graph with
four edges and print the list of edges emanating from node 'A
':
>>> edge('A', 'B')
>>> edge('B', 'C')
>>> edge('C', 'D')
>>> edge('C', 'A')
>>> print getnode('A').edgelist[AB, CA]
Suppose that I now want a weighted graph. I could easily add a
weight=1.0
default argument to the
edge constructor, and the old input would still work. Also, I could
easily add error-checking logic to ensure that edge lists have no
duplicates. Furthermore, I already have my node class and can start
adding logic to it for any needed processing purposes, be it directly
or by subclassing. I can easily turn the entries in the dictionary
nodes into similarly named variables that are bound to the node
objects. After adding a few more classes corresponding to other input
I need, I am well on my way.
The advantage to this approach is clear. For example, the following is already handled correctly:
edge('A', 'B') if 'X' in nodes: edge('X', 'A') def triangle(n1, n2, n3): edge(n1, n2) edge(n2, n3) edge(n3, n1) triangle('A','W','K') execfile('mygraph.txt') # Read graph from a datafile
So I already have syntactic sugar, user-defined language extensions, and input from other files. The definitions usually go into a module, and the user simply import them. Had I written my own language, instead of reusing Python in this little language role, such accomplishments might be months away.
Python programs have the ability to examine themselves; this set of facilities comes under the general title of introspection. For example, a Python function object knows a lot about itself, including the names of its arguments, and the docstring that was given when it was defined:
>>> def f(a, b): " Return the difference of a and b " return a-b ... >>> dir(f)['_ _call_ _', '_ _class_ _', '_ _delattr_ _', '_ _dict_ _', '_ _doc_ _',
'_ _get_ _', '_ _getattribute_ _', '_ _hash_ _', '_ _init_ _', '_ _module_ _',
'_ _name_ _', '_ _new_ _', '_ _reduce_ _', '_ _reduce_ex_ _', '_ _repr_ _',
'_ _setattr_ _', '_ _str_ _', 'func_closure', 'func_code', 'func_defaults',
'func_dict', 'func_doc', 'func_globals', 'func_name']
>>> f.func_name'f'
>>> f.func_doc'Return the difference of a and b'
>>> f.func_code<code object f at 0175DDF0, file "<pyshell#18>", line 1>
>>> dir (f.func_code)['_ _class_ _', '_ _cmp_ _', '_ _delattr_ _', '_ _doc_ _',
'_ _getattribute_ _', '_ _hash_ _', '_ _init_ _', '_ _new_ _', '_ _reduce_ _',
'_ _reduce_ex_ _', '_ _repr_ _', '_ _setattr_ _', '_ _str_ _', 'co_argcount',
'co_cellvars', 'co_code', 'co_consts', 'co_filename', 'co_firstlineno',
'co_flags', 'co_freevars', 'co_lnotab', 'co_name', 'co_names',
'co_nlocals', 'co_stacksize', 'co_varnames']
>>> f.func_code.co_names('a', 'b')
SPARK and PLY make an interesting use of introspection. The grammar is entered as docstrings in the routines that take the semantic actions when those grammar constructs are recognized. (Hey, don’t turn your head all the way around like that! Introspection has its limits.)
Introspection is very popular in the Python community, and you
will find many examples of it in recipes in this book, both in this
chapter and elsewhere. Even in this field, though,
always remember the possibility of reuse!
Standard library module inspect
has
a lot of solid, reusable inspection-related code. It’s all pure Python
code, and you can (and should) study the inspect.py source file in your Python
library to see what “raw” facilities underlie inspect
’s elegant high-level
functions—indeed, this suggestion generalizes: studying the standard
library’s sources is among the best things you can do to increment
your Python knowledge and skill. But reusing the
standard library’s wealth of modules and packages is still best: any
code you don’t write is code you don’t have to maintain, and solid,
heavily tested code such as the code that you find in the standard
library is very likely to have far fewer bugs than any newly developed
code you might write yourself.
Python is the most powerful language that you can still read. The kinds of tasks discussed in this chapter help to show just how versatile and powerful it really is.
Credit: Gyro Funch, Rogier Steehouder
You need to check whether a string read from a file or obtained from user input has a valid numeric format.
The simplest and most Pythonic approach is to “try and see”:
def is_a_number(s): try: float(s) except ValueError: return False else: return True
If you insist, you can also perform this task with a regular expression:
import re num_re = re.compile(r'^[-+]?([0-9]+.?[0-9]*|.[0-9]+)([eE][-+]?[0-9]+)?$') def is_a_number(s): return bool(num_re.match(s))
Having a regular expression to start from may be best if you need to be tolerant of certain specific variations, or to pick up numeric substrings from the middle of larger strings. But for the specific task posed as this recipe’s Problem, it’s simplest and best to “let Python do it!”
Documentation for the re
module and the float
built-in
module in the Library Reference and
Python in a Nutshell.
Credit: Anders Hammarquist
You need to wrap code in either compiled or source form
in a module, possibly adding it to sys.modules
as well.
We build a new module object, optionally add it to sys.modules
, and populate it with an
exec
statement:
import new def importCode(code, name, add_to_sys_modules=False): """ code can be any object containing code: a string, a file object, or a compiled code object. Returns a new module object initialized by dynamically importing the given code, and optionally adds it to sys.modules under the given name. """module = new.module(name) if add_to_sys_modules: import sys sys.modules[name] = module exec code in module._ _dict_ _ return module
This recipe lets you import a module from code that is
dynamically generated or obtained. My original intent for it was to
import a module stored in a database, but it will work for modules
from any source. Thanks to the flexibility of the exec
statement, the
importCode
function can accept code in many forms: a
string of source (which gets implicitly compiled on the fly), a file
object (ditto), or a previously compiled code object.
The addition of the newly generated module to sys.modules
is optional. You shouldn’t
normally do so for such dynamically obtained code, but there are
exceptions—specifically, when import
statements for the module’s name are
later executed, and it’s important that they retrieve from sys.modules
your dynamically generated
module. If you want the sys.modules
addition, it’s best to perform it before the module’s code body
executes, just as normal import statements do, in case the code body
relies on that normal behavior (which it usually doesn’t, but it can’t
hurt to be prepared).
Note that the normal Python statement:
import foo
in simple cases (where no hooks, built-in modules, imports from zip files, etc., come into play!) is essentially equivalent to:
if 'foo' in sys.modules: foo = sys.modules['foo'] else: foofile = open("/path/to/foo.py") # for some suitable /path/to/... foo = importCode(foofile, "foo", 1)
A toy example of using this recipe:
code = """ def testFunc( ): print "spam!" class testClass(object): def testMethod(self): print "eggs!" """ m = importCode(code, "test") m.testFunc( ) o = m.testClass( ) o.testMethod( )
Sections on the import
and
exec
statements in the
Language Reference; documentation on the
modules
attribute of the sys
standard library module and the new
module in the Library
Reference; Python in a Nutshell
sections about both the language and library aspects.
Credit: Jürgen Hermann
You need to import a name from a module, just as from
module
import
name
would do, but
module
and name
are runtime-computed expressions. This need often arises, for example,
when you want to support user-written plug-ins.
The _ _import_ _
built-in
function lets you perform this task:
def importName(modulename, name):
""" Import a named object from a module in the context of this function.
"""
try:module = _ _import_ _(modulename, globals( ), locals( ), [name])
except ImportError:
return None
return getattr(module, name)
This recipe’s function lets you perform the equivalent of
from
module
import
name
, in
which either or both module
and
name
are dynamic values (i.e., expressions
or variables) rather than constant strings. For example, this
functionality can be used to implement a plug-in mechanism to extend
an application with external modules that adhere to a common
interface.
Some programmers’ instinctive reaction to this task would be to
use exec
, but this instinct would
be a pretty bad one. The exec
statement is too powerful, and therefore is a
last-ditch measure, to be used only when nothing else is available
(which is almost never). It’s just too easy to have horrid bugs and/or
security weaknesses where exec
is
used. In almost all cases, there are better ways. This recipe shows
one such way for an important problem.
For example, suppose you have, in a file named MyApp/extensions/spam.py, the following code:
class Handler(object): def handleSomething(self): print "spam!"
and, in a file named MyApp/extensions/eggs.py
:
class Handler(object): def handleSomething(self): print "eggs!"
We must also suppose that the MyApp
directory is in a directory on
sys.path
, and both it and the
extensions
subdirectory are
identified as Python packages (meaning that
each of them must contain a file, possibly empty, named _ _init_ _.py). Then, we can get and call
both implementations with the following code:
for extname in 'spam', 'eggs': HandlerClass = importName("MyApp.extensions." + extname, "Handler") handler = HandlerClass( ) handler.handleSomething( )
It’s possible to remove the constraints about sys.path
and _
_init_ _.py, and dynamically import from anywhere, with the
imp
standard module. However,
imp
is substantially harder to use
than the _ _import_ _
built-in
function, and you can generally arrange things to avoid imp
’s greater generality and
difficulty.
The import
pattern
implemented by this recipe is used in MoinMoin (http://moin.sourceforge.net/)
to load extensions implementing variations of a common interface, such
as action
, macro
, and formatter
.
Documentation on the _ _import_
_
and getattr
built-ins
in the Library Reference and Python
in a Nutshell; MoinMoin is available at http://moin.sourceforge.net.
Credit: Scott David Daniels, Nick Perkins, Alex Martelli, Ben Wolfson, Alex Naanou, David Abrahams, Tracy Ruggles
You need to wrap a function (or other callable) to get another callable with fewer formal arguments, keeping given values fixed for the other arguments (i.e., you need to curry a callable to make another).
Curry is not just a delightful spice used in Asian cuisine—it’s also an important programming technique in Python and other languages:
def curry(f, *a, **kw): def curried(*more_a, **more_kw): return f(*(a+more_a), **dict(kw, **more_kw)) return curried
Popular in functional programming, currying is a way to bind some of a function’s arguments and wait for the rest of them to show up later. Currying is named in honor of Haskell Curry, a mathematician who laid some of the cornerstones in the theory of formal systems and processes. Some pedants (and it must be grudgingly admitted they have a point) claim that the technique shown in this recipe should be called partial application, and that “currying” is something else. But whether they’re right or wrong, in a book whose title claims it’s a cookbook, the use of curry in a title was simply irresistible. Besides, the use of the verb to curry that this recipe supports is the most popular one among programmers.
The curry
function defined in this recipe is
invoked with a callable and some or all of the arguments to the
callable. (Some people like to refer to functions that accept function
objects as arguments, and return new function objects as results, as
higher-order functions.) The
curry
function returns a closure
curried
that takes subsequent parameters as
arguments and calls the original with all of those parameters. For
example:
double = curry(operator.mul, 2) triple = curry(operator.mul, 3)
To implement currying, the choice is among closures, classes with callable instances, and lambda forms. Closures are simplest and fastest, so that’s what we use in this recipe.
A typical use of curry
is to construct callback
functions for GUI operations. When the operation does not merit a new
function name, curry
can be useful in creating these
little functions. For example, this can be the case with commands for
Tkinter buttons:
self.button = Button(frame, text='A', command=curry(transcript.append, 'A'))
Recipe 11.2
shows a specialized subset of “curry” functionality intended to
produce callables that require no arguments, which are often needed
for such GUI-callback usage. However, this recipe’s
curry
function is vastly more flexible, without any
substantial extra cost in either complexity or performance.
Currying can also be used interactively to make versions of your functions with debugging-appropriate defaults, or initial parameters filled in for your current case. For example, database debugging work might begin by setting:
Connect = curry(ODBC.Connect, dsn='MyDataSet')
Another example of the use of curry
in debugging is to wrap
methods:
def report(originalFunction, name, *args, **kw): print "%s(%s)"%(name, ', '.join(map(repr, args) + [k+'='+repr(kw[k]) for k in kw]) result = originalFunction(*args, **kw) if result: print name, '==>', result return result class Sink(object): def write(self, text): pass dest = Sink( ) dest.write = curry(report, dest.write, 'write') print >>dest, 'this', 'is', 1, 'test'
If you are creating a function for regular use, the def fun
form of function definition is more
readable and more easily extended. As you can see from the
implementation, no magic happens to specialize the function with the
provided parameters. curry
should be used when you
feel the code is clearer with its use than without. Typically, this
use will emphasize that you are only providing some pre-fixed
parameters to a commonly used function, not providing any separate
processing.
Currying also works well in creating a “lightweight subclass”.
You can curry
the constructor of a class to give the
illusion of a subclass:
BlueWindow = curry(Window, background="blue")
BlueWindow._ _class_ _
is still
Window
, not a subclass, but if you’re changing only
default parameters, not behavior, currying is arguably more
appropriate than subclassing anyway. And you can still pass additional
parameters to the curried constructor.
Two decisions must be made when coding a curry implementation, since both positional and keyword arguments can come in two “waves”—some at currying time, some more at call time. The two decisions are: do the call-time positional arguments go before or after the currying-time ones? do the call-time keyword arguments override currying-time ones, or vice versa? If you study this recipe’s Solution, you can see I’ve made these decisions in a specific way (the one that is most generally useful): call-time positional arguments after currying-time ones, call-time keyword arguments overriding currying-time ones. In some circles, this is referred to as left-left partial application. It’s trivial to code other variations, such as right-left partial application:
def rcurry(f, *a, **kw): def curried(*more_a, **more_kw): return f(*(more_a+a), **dict(kw, **more_kw)) return curried
As you can see, despite the grandiose-sounding terms, this is
just a matter of concatenating more_a+a
rather than the reverse; and
similarly, for keyword arguments, you just need to call dict(more_kw, **kw)
if you want
currying-time keyword arguments to override call-time ones rather than
vice versa.
If you wish, you could have the curried function carry a copy of
the original function’s docstring, or even (easy in Python 2.4, but
feasible, with a call to new.function
, even in 2.3—see the sidebar in
Recipe 20.1) a name
that is somehow derived from the original function. However, I have
chosen not to do so because the original name, and argument
descriptions in the docstring, are probably not appropriate
for the curried version. The task of constructing and
documenting the actual signature of the curried version is also
feasible (with a liberal application of the helper functions from
standard library module inspect
),
but it’s so disproportionate an effort, compared to
curry
’s delightfully simple four lines of code (!),
that I resolutely refuse to undertake it.
A special case, which may be worth keeping in mind, is when the
callable you want to curry is a Python function
(not a bound method, a C-coded function, a
callable class instance, etc.), and all you need
to curry is the first parameter. In this case,
the function object’s _ _get_ _
special method may be all you need. It takes an arbitrary argument and
returns a bound-method object with the first parameter bound to that
argument. For example:
>>> def f(adj, noun='world'): ... return 'Goodbye, %s %s!' % (adj, noun) ... >>> cf = f._ _get_ _('cruel') >>> print cf( )Goodbye, cruel world!
>>> cf<bound method ?.f of 'cruel'>
>>> type(cf)<type 'instancemethod'>
>>> cf.im_func<function f at 0x402dba04>
>>> cf.im_self'cruel'
Recipe 11.2
shows a specialized subset of the curry functionality that is
specifically intended for GUI callbacks; docs for the inspect
module and the dict
built-in type in the Library
Reference and Python in a
Nutshell.
Credit: Scott David Daniels
You need to construct a new function by composing existing functions (i.e., each call of the new function must call one existing function on its arguments, then another on the result of the first one).
Composition is a fundamental operation between functions and
yields a new function as a result. The new function must call one
existing function on its arguments, then another on the result of the
first one. For example, a function that, given a string, returns a
copy that is lowercase and does not have leading and trailing blanks,
is the composition of the existing string.lower
and string.strip
functions. (In this case, it
does not matter in which order the two existing functions are applied,
but generally, it could be important.)
A closure (a nested function returned from another function) is often the best Pythonic approach to constructing new functions:
def compose(f, g, *args_for_f, **kwargs_for_f): ''' compose functions. compose(f, g, x)(y) = f(g(y), x)) ''' def fg(*args_for_g, **kwargs_for_g): return f(g(*args_for_g, **kwargs_for_g), *args_for_f, **kwargs_for_f) return fg def mcompose(f, g, *args_for_f, **kwargs_for_f): ''' compose functions. mcompose(f, g, x)(y) = f(*g(y), x)) ''' def fg(*args_for_g, **kwargs_for_g): mid = g(*args_for_g, **kwargs_for_g) if not isinstance(mid, tuple): mid = (mid,) return f(*(mid+args_for_f), **kwargs_for_f) return fg
The closures in this recipe show two styles of function
composition. I separated mcompose
and compose
because I think of the
two possible forms of function composition as being quite different,
in mathematical terms. In practical terms, the difference shows only
when the second function being composed, g
, returns a
tuple. The closure returned by compose
passes the
result of g
as f
’s first argument
anyway, while the closure returned by mcompose
treats
it as a tuple of arguments to pass along. Any extra arguments provided
to either compose
or mcompose
are
treated as extra arguments for f
(there is no
standard functional behavior to follow here):
compose(f, g, x)(y) = f(g(y), x) mcompose(f, g, x)(y) = f(*g(y), x)
As in currying (see Recipe 16.4), this recipe’s functions are for constructing functions from other functions. Your goal in so doing should be clarity, since no efficiency is gained by using these functional forms.
Here’s a quick example for interactive use:
parts = compose(' '.join, dir)
When called on a module object, the callable we just bound to
name parts
gives you an easy-to-view string
that lists the module’s contents.
Recipe 16.4 for an example of “curry"ing (i.e., associating parameters with partially evaluated functions).
Credit: Jürgen Hermann, Mike Brown
You need to convert Python source code into HTML markup, rendering comments, keywords, operators, and numeric and string literals in different colors.
tokenize.generate_tokens
does
most of the work. We just need to loop over all tokens it finds, to
output them with appropriate colorization:
""" MoinMoin - Python Source Parser """ import cgi, sys, cStringIO import keyword, token, tokenize # Python Source Parser (does highlighting into HTML) _KEYWORD = token.NT_OFFSET + 1 _TEXT = token.NT_OFFSET + 2 _colors = { token.NUMBER: '#0080C0', token.OP: '#0000C0', token.STRING: '#004080', tokenize.COMMENT: '#008000', token.NAME: '#000000', token.ERRORTOKEN: '#FF8080', _KEYWORD: '#C00000', _TEXT: '#000000', } class Parser(object): """ Send colorized Python source HTML to output file (normally stdout). """ def _ _init_ _(self, raw, out=sys.stdout): """ Store the source text. """ self.raw = raw.expandtabs( ).strip( ) self.out = out def format(self): """ Parse and send the colorized source to output. """ # Store line offsets in self.lines self.lines = [0, 0] pos = 0 while True: pos = self.raw.find(' ', pos) + 1 if not pos: break self.lines.append(pos) self.lines.append(len(self.raw)) # Parse the source and write it self.pos = 0 text = cStringIO.StringIO(self.raw) self.out.write('<pre><font face="Lucida, Courier New">') try: for tok in tokenize.generate_tokens(text.readline): # unpack the components of each token toktype, toktext, (srow, scol), (erow, ecol), line = tok if False: # You may enable this for debugging purposes only print "type", toktype, token.tok_name[toktype], print "text", toktext, print "start", srow,scol, "end", erow,ecol, "<br>" # Calculate new positions oldpos = self.pos newpos = self.lines[srow] + scol self.pos = newpos + len(toktext) # Handle newlines if toktype in (token.NEWLINE, tokenize.NL): self.out.write(' ') continue # Send the original whitespace, if needed if newpos > oldpos: self.out.write(self.raw[oldpos:newpos]) # Skip indenting tokens, since they're whitespace-only if toktype in (token.INDENT, token.DEDENT): self.pos = newpos continue # Map token type to a color group if token.LPAR <= toktype <= token.OP: toktype = token.OP elif toktype == token.NAME and keyword.iskeyword(toktext): toktype = _KEYWORD color = _colors.get(toktype, _colors[_TEXT]) style = '' if toktype == token.ERRORTOKEN: style = ' style="border: solid 1.5pt #FF0000;"' # Send text self.out.write('<font color="%s"%s>' % (color, style)) self.out.write(cgi.escape(toktext)) self.out.write('</font>') except tokenize.TokenError, ex: msg = ex[0] line = ex[1][0] self.out.write("<h3>ERROR: %s</h3>%s " % ( msg, self.raw[self.lines[line]:])) self.out.write('</font></pre>') if _ _name_ _ == "_ _main_ _": print "Formatting..." # Open own source source = open('python.py').read( ) # Write colorized version to "python.html" Parser(source, open('python.html', 'wt')).format( ) # Load HTML page into browser import webbrowser webbrowser.open("python.html")
This code is part of MoinMoin (see http://moin.sourceforge.net/) and shows how to
use the built-in keyword
, token
, and tokenize
modules to scan Python source code
and re-emit it with appropriate color markup but no changes to its
original formatting (“no changes” is the hard part!).
The Parser
class’ constructor saves the
multiline string that is the Python source to colorize, and the file
object, which is open for writing, where you want to output the
colorized results. Then, the format
method prepares a
self.lines
list that holds the offset (i.e., the
index into the source string, self.raw
) of each
line’s start.
format
then loops over the result of generator
tokenize.tokenize
, unpacking each
token tuple into items specifying the token type and starting and
ending positions in the source (each expressed as line number and
offset within the line). The body of the loop reconstructs the exact
position within the original source code string self.raw
, so it can emit exactly the same
whitespace that was present in the original source. It then picks a
color code from the _colors
dictionary (which uses
HTML color coding), with help from the keyword
standard module to determine whether
a NAME
token is actually a Python
keyword (to be output in a different color than that used for ordinary
identifiers).
The test code at the bottom of the module formats the module
itself and launches a browser with the result, using the standard
Python library module webbrowser
to
enable you to see and enjoy the result in your favorite
browser.
If you put this recipe’s code into a module, you can then import
the module and reuse its functionality in CGI scripts (using the
PATH_TRANSLATED
CGI environment
variable to know what file to colorize), command-line tools (taking
filenames as arguments), filters that colorize anything they get from
standard input, and so on. See http://skew.org/~mike/colorize.py for versions
that support several of these various possibilities.
With small changes, it’s also easy to turn this recipe into an Apache handler, so your Apache web site can serve colorized .py files. Specifically, if you set up this script as a handler in Apache, then the file is served up as colorized HTML whenever a visitor to the site requests a .py file.
For the purpose of using this recipe as an Apache handler, you need to save the script as colorize.cgi (not .py, lest it confuses Apache), and add, to your .htaccess or httpd.conf Apache configuration files, the following lines:
AddHandler application/x-python .py Action application/x-python /full/virtual/path/to/colorize.cgi
Also, make sure you have the Action
module enabled in your httpd.conf Apache configuration
file.
Documentation for the webbrowser
, token
, tokenize
, and keyword
modules in the Library
Reference and Python in a Nutshell;
the colorizer is available at http://purl.net/wiki/python/MoinMoinColorizer,
as part of MoinMoin (http://moin.sourceforge.net), and, in a
somewhat different variant, also at http://skew.org/~mike/colorize.py; the Apache
web server is available and documented at http://httpd.apache.org.
Credit: Peter Cogolo
You need to tokenize an input language whose tokens are almost the same as Python’s, with a few exceptions that need token merging and splitting.
Standard library module tokenize
is very handy; we need to wrap it
with a generator to do the post-processing for a little splitting and
merging of tokens. The merging requires the ability to “peek ahead” in
an iterator. We can get that ability by wrapping any iterator into a
small dedicated iterator class:
class peek_ahead(object): sentinel = object( ) def _ _init_ _(self, it): self._nit = iter(it).next self.preview = None self._step( ) def _ _iter_ _(self): return self def next(self): result = self._step( ) if result is self.sentinel: raise StopIteration else: return result def _step(self): result = self.preview try: self.preview = self._nit( ) except StopIteration: self.preview = self.sentinel return result
Armed with this tool, we can easily split and merge tokens. Say,
for example, by the rules of the language we’re lexing, that we must
consider each of ':=
' and ':+
' to be a single token, but a
floating-point token that is a '.
'
with digits on both sides, such as '31.17
', must be given as a sequence of three
tokens, '31
', '.
', '17
'
in this case. Here’s how (using Python 2.4 code with comments on how
to change it if you’re stuck with version 2.3):
import tokenize, cStringIO # in 2.3, also do 'from sets import Set as set' mergers = {':' : set('=+'), } def tokens_of(x): it = peek_ahead(toktuple[1] for toktuple in tokenize.generate_tokens(cStringIO.StringIO(x).readline) ) # in 2.3, you need to add brackets [ ] around the arg to peek_ahead for tok in it: if it.preview in mergers.get(tok, ( )): # merge with next token, as required yield tok+it.next( ) elif tok[:1].isdigit( ) and '.' in tok: # split if digits on BOTH sides of the '.' before, after = tok.split('.', 1) if after: # both sides -> yield as 3 separate tokens yield before yield '.' yield after else: # nope -> yield as one token yield tok else: # not a merge or split case, just yield the token yield tok
Here’s an example of use of this recipe’s code:
>>> x = 'p{z:=23, w:+7}: m :+ 23.4'
>>> print ' / '.join(tokens_of(x))p / { / z / := / 23 / , / w / :+ / 7 / } / : / m / :+ / 23 / . / 4 /
In this recipe, I yield tokens only as substrings of the string
I’m lexing, rather than the whole tuple
yielded by tokenize.generate_tokens
, including such
items as token position within the overall string (by line and
column). If your needs are more sophisticated than mine, you should
simply peek_ahead
on whole token tuples (while I’m
simplifying things by picking up just the substring, item 1, out of
each token tuple, by passing to peek_ahead
a
generator expression), and compute start and end positions
appropriately when splitting or merging. For example, if you’re
merging two adjacent tokens, the overall token has the same start
position as the first, and the same end position as the second, of the
two tokens you’re merging.
The peek_ahead
iterator wrapper class can often
be useful in many kinds of lexing and parsing tasks, exactly because
such tasks are well suited to operating on streams (which are well
represented by iterators) but often require a level of peek-ahead
and/or push-back ability. You can often get by with just one level; if
you need more than one level, consider having your wrapper hold a
container of peeked-ahead or pushed-back tokens. Python 2.4’s collections.deque
container implements a
double-ended queue, which is particularly well suited for such tasks.
For a more powerful look-ahead iterator wrapper, see Recipe 19.18.
Library Reference and Python
in a Nutshell sections on the Python Standard Library
modules tokenize
and cStringIO
; Recipe 19.18 for a more
powerful look-ahead iterator wrapper.
Credit: Peter Cogolo
You need to check whether a certain string has balanced parentheses, but regular expressions are not powerful enough for this task.
We want a “true” parser to check a string for balanced parentheses, since parsing theory proves that a regular expression is not sufficient. Choosing one out of the many Python parser generators, we’ll use David Beazley’s classic but evergreen PLY:
# define token names, and a regular expression per each token tokens = 'OPEN_PAREN', 'CLOS_PAREN', 'OTHR_CHARS' t_OPEN_PAREN = r'(' t_CLOS_PAREN = r')' t_OTHR_CHARS = r'[^( )]+' # RE meaning: one or more non-parentheses def t_error(t): t.skip(1) # make the lexer (AKA tokenizer) import lex lexer = lex.lex(optimize=1) # define syntax action-functions, with syntax rules in docstrings def p_balanced(p): ''' balanced : balanced OPEN_PAREN balanced CLOS_PAREN balanced | OTHR_CHARS | ''' if len(p) == 1: p[0] = '' elif len(p) == 2: p[0] = p[1] else: p[0] = p[1]+p[2]+p[3]+p[4]+p[5] def p_error(p): pass # make the parser (AKA scanner) import yacc parser = yacc.yacc( ) def has_balanced_parentheses(s): if not s: return True result = parser.parse(s, lexer=lexer) return s == result
Here’s an example of use of this recipe’s code:
>> s = 'ba(be, bi(bo, bu))' >> print s, is_balanced(s)ba(be, bi(bo, bu)) True
>> s = 'ba(be, bi(bo), bu))' >> print s, is_balanced(s)ba(be, bi(bo), bu)) False
The first string has balanced parentheses, but the second one has an extra closed parenthesis; therefore, its parentheses are not balanced.
“How do I check a string for balanced parentheses?” is a frequently asked question about regular expressions. Programmers without a computer science background are often surprised to hear that regular expressions just aren’t powerful enough for this apparently simple task and a more complete form of grammar is required. (Perl’s regular expressions plus arbitrary embedded expressions kitchen sink does suffice—which just proves they aren’t anywhere near “regular” expressions any more!)
For this very simplified parsing problem we’re presenting, any
real parser is overkill—just loop over the string’s characters,
keeping a running count of the number of open and yet unclosed
parentheses encountered at this point, and return False
if the running count ever goes
negative or doesn’t go back down to exactly 0 at the end:
def has_bal_par(s): op = 0 for c in s: if c=='(': op += 1 elif c==')': if op == 0: return False op -= 1 return op == 0
However, using a parser when you need to parse is still a better
idea, in general, than hacking up special-purpose code such as this
has_bal_par
function. As soon as the problem gets
extended a bit (and problems invariably do grow,
in real life, in often unpredictable directions), a real parser can
grow gracefully and proportionally with the problem, while ad hoc code
often must be thrown away and completely rewritten.
All over the web, you can find oodles of Python packages that are suitable for lexing and parsing tasks. My favorite, out of all of them, is still good old PLY, David Beazley’s Python Lexx and Yacc, which reproduces the familiar structure of Unix commands lexx and yacc while taking advantage of Python’s extra power when compared to the C language that those Unix commands support.
You can find PLY at http://systems.cs.uchicago.edu/ply/. PLY is a pure Python package: download it (as a .tgz compressed archive file), decompress and unarchive it (all reasonable archiving tools now support this subtask on all platforms), open a command shell, cd into the directory into which you unarchived PLY, and run the usual python setup.py install, with the proper privileges to be able to write into your Python installation’s site-packages directory (which privileges those are depends on how you installed Python, and on what platform you’re running). Briefly, install it just as you would install any other pure Python package.
As you can see from this recipe, PLY is quite easy to use, if
you know even just the fundamentals of lexing and parsing. First, you
define your grammar’s tokens—make a tuple or
list of all their names (conventionally uppercase) bound to name
tokens
at your module’s top level,
define for each token a regular expression bound to name t_
token_name
(again at the module’s top level), import lex
, and call lex.lex
to build your tokenizer (lexer).
Then, define your grammar’s action functions (each of them carries the
relevant syntax rule—production—in its
docstring in BNF, Backus-Naur Form), import yacc
, and call yacc.yacc
to build your parser (scanner). To
parse any string, call the parse
method of your parser with the string as an argument.
All the action is in your grammar’s action functions, as their
name implies. Each action function receives as its single argument
p
a list of production elements corresponding to the
production that has been matched to invoke that function; the action
function’s job is to put into p[0]
whatever you need as “the result” of that syntax rule getting matched.
In this recipe, we use as results the very strings we have been
matching, so that function is_balanced
just needs to
check whether the whole string is matched by the parse
operation.
When you run this script the first time, you will see a warning about a shift/reduce conflict. Don’t worry: as any old hand at yacc can tell you, that’s the yacc equivalent of a rite of passage. If you want to understand that message in depth, and maybe (if you’re an ambitious person) even do something about it, open with your favorite browser the doc/ply.html file in the directory in which you unpacked PLY. That file contains a rather thorough documentation of PLY. As that file suggests, continue by studying the contents of the examples directory and then read a textbook about compilers—I suggest Dick Grune and Ceriel J.H. Jacobs, “Parsing Techniques, a Practical Guide.” The first edition, at the time of this writing, is freely available for download as a PDF file from http://www.cs.vu.nl/~dick/PTAPG.html, and a second edition should be available in technical bookstores around the middle of 2005.
PLY web page at http://systems.cs.uchicago.edu/ply/; Dick Grune and Ceriel J.H. Jacobs, “Parsing Techniques, a Practical Guide,” a PDF, downloadable from http://www.cs.vu.nl/~dick/PTAPG.html.
Credit: Will Ware
Python’s introspection facilities let you code a class that
implements a version of enum
, even
though Python, as a language, does not support the enum
construct:
class EnumException(Exception): pass class Enumeration(object): def _ _init_ _(self, name, enumList, valuesAreUnique=True): self._ _doc_ _ = name self.lookup = lookup = { } self.reverseLookup = reverseLookup = { } i = 0 for x in enumList: if type(x) is tuple: try: x, i = x except ValueError: raise EnumException, "tuple doesn't have 2 items: %r" % (x,) if type(x) is not str: raise EnumException, "enum name is not a string: %r" % (x,) if type(i) is not int: raise EnumException, "enum value is not an integer: %r" % (i,) if x in lookup: raise EnumException, "enum name is not unique: %r" % (x,) if valuesAreUnique and i in reverseLookup: raise EnumException, "enum value %r not unique for %r" % (i, x) lookup[x] = i reverseLookup[i] = x i = i + 1def _ _getattr_ _(self, attr): try: return self.lookup[attr] except KeyError: raise AttributeError, attr def whatis(self, value): return self.reverseLookup[value]
In the C language, enum
lets
you declare several named constants, typically with unique values
(although you can also explicitly arrange for a value to be duplicated
under two different names), without necessarily specifying the actual
values (except when you want it to). Despite the similarity in naming,
C’s enum
and this recipe’s
Enumeration
class have little to do with the Python
built-in enumerate
generator, which
is used to loop on (
index
, item
)
pairs given an iterable—an
entirely different issue!
Python has an accepted idiom that’s fine for small numbers of constants:
A, B, C, D = range(4)
However, this idiom doesn’t scale well to large numbers of
constants and doesn’t allow you to specify values for some constants
while leaving others to be determined automatically. This recipe
provides for all these niceties and, optionally, also checks that all
values (both the ones explicitly specified and the ones automatically
determined) are unique. Enum values are attributes of an
Enumeration
class instance
(Volkswagen.BEETLE
,
Volkswagen.PASSAT
, etc.). A further feature, missing
in C but really quite useful, is the ability to go from the value to
the corresponding name inside the enumeration (the name you get can be
somewhat arbitrary for those enumerations in which you don’t constrain
values to be unique).
This recipe’s Enumeration
class has an
initializer that accepts a string argument to specify the
enumeration’s name and a sequence argument to specify the names of all
values in the enumeration. Each item of the sequence argument can be a
string (to specify that the value named is one more than the last
value used) or else a tuple with two items (the string that is the
value’s name, then the value itself, which must be an integer). The
code in this recipe relies heavily on strict type checking to
determine which case applies, but the recipe’s essence would not
change by much if the checking was performed in a more lenient way
(e.g., with the isinstance
built-in
function).
Each Enumeration
instance has two dict
attributes:
self.lookup
to map names to values and
self.reverselookup
to map values back to the
corresponding names. The special method _
_getattr_ _
lets you use names with attribute syntax
(e.x
is mapped to e.lookup['x']
), and the
whatis
method allows reverse lookups (i.e., find a
name given a value) with similar ease.
Here’s an example of how you can use this
Enumeration
class:
if _ _name_ _ == '_ _main_ _': import pprint Volkswagen = Enumeration("Volkswagen", ("JETTA", "RABBIT", "BEETLE", ("THING", 400), "PASSAT", "GOLF", ("CABRIO", 700), "EURO_VAN", "CLASSIC_BEETLE", "CLASSIC_VAN" )) Insect = Enumeration("Insect", ("ANT", "APHID", "BEE", "BEETLE", "BUTTERFLY", "MOTH", "HOUSEFLY", "WASP", "CICADA", "GRASSHOPPER", "COCKROACH", "DRAGONFLY" )) def whatkind(value, enum): return enum._ _doc_ _ + "." + enum.whatis(value) class ThingWithKind(object): def _ _init_ _(self, kind): self.kind = kind car = ThingWithKind(Volkswagen.BEETLE) print whatkind(car.kind, Volkswagen) # emitsVolkswagen.BEETLE
bug = ThingWithKind(Insect.BEETLE) print whatkind(bug.kind, Insect) # emitsInsect.BEETLE
print car._ _dict_ _ # emits{'kind': 2}
print bug._ _dict_ _ # emits{'kind': 3}
pprint.pprint(Volkswagen._ _dict_ _) pprint.pprint(Insect._ _dict_ _) # emits dozens of line showing off lookup and reverseLookup dictionaries
Note that the attributes of car
and
bug
don’t include any of the enum
machinery because that machinery is
held as class attributes, not as instance attributes. This means you
can generate thousands of car
and
bug
objects with reckless abandon, never
worrying about wasting time or memory on redundant copies of the
enum
stuff.
Recipe 6.2 shows
how to define constants in Python; documentation on the special method
_ _getattr_ _
in the
Language Reference and Python in a
Nutshell.
Credit: Chris Perkins
You want to refer, from inside a list comprehension, to the same list object you’re building. However, the object being built by the list comprehension doesn’t have a name while you’re building it.
Internally, the Python interpreter does create a “secret” name
that exists only while a list comprehension is being built. In Python
2.3, that name is usually '_[1]
'
and refers to the bound method append
of the list object we’re building. We
can use this secret name as a back door to refer to the list object as
it gets built. For example, say we want to build a copy of a list but
without duplicates:
>>> L = [1, 2, 2, 3, 3, 3]
>>> [x for x in L if x not in locals( )['_[1]']._ _self_ _][1, 2, 3]
Python 2.4 uses the same name to indicate the list object being
built, rather than the bound-method access
. In the case of nested list
comprehensions, inner ones are named '_[2]
', '_[3]
', and so on, as nesting goes deeper.
Clearly, all of these considerations are best wrapped up into a
function:
import inspect import sys version_23 = sys.version_info < (2, 4) def this_list( ): import sys d = inspect.currentframe(1).f_locals nestlevel = 1 while '_[%d]' % nestlevel in d: nestlevel += 1 result = d['_[%d]' % (nestlevel - 1)] if version_23: return result._ _self_ _ else: return result
Using this function, we can make the preceding snippet more readable, as well as making it work properly in Python 2.4 as well as in version 2.3:
>>> [x for x in L if x not in this_list( )][1, 2, 3]
List comprehensions may look a little like magic, but the
bytecode that Python generates for them is in fact quite mundane:
create an empty list, give the empty list’s bound-method append
, a temporary name in the locals
dictionary, append items one at a time, and then delete the name. All
of this happens, conceptually, between the open square bracket (
[
) and the close square bracket (
]), which enclose the list comprehension.
The temporary name that Python 2.3 assigns to the bound append
method is '_[1]
' (or '_[2]
', etc., for nested list
comprehensions). This name is deliberately chosen (to avoid accidental
clashes) to not be a syntactically valid Python
identifier, so we cannot refer to the bound method directly, by name.
However, we can access it as locals( )['_[1]']
. Once we have a reference
to the bound method object, we just use the bound method’s _ _self_ _
attribute to get at the list
object itself. In Python 2.4, the same name refers directly to the
list object, rather than to its bound method, so we skip the last
step.
Having a reference to the list object enables us to do all sorts
of neat party tricks, such as performing if
tests that involve looking at the items
that have already been added to the list, or even modifying or
deleting them. These capabilities are just what the doctor ordered for
finding primes in a “one-liner”, for example: for each odd number, we
need to test whether it is divisible by any prime number less than or
equal to the square root of the number being tested. Since we already
have all the smaller primes stored and, with our new parlor trick,
have access to them, this test is a breeze and requires no auxiliary
storage:
import itertools def primes_less_than(N): return [ p for p in itertools.chain([2], xrange(3,N,2)) if 0 not in itertools.imap( lambda x: p % x, itertools.takewhile( lambda v: v*v <= p, this_list( ) ))]
The list comprehension that’s the whole body of this function
primes_less_than
, while long enough not to fit into a
single physical line, is all in a single logical
line (indeed, it must be, since any list comprehension is a single
expression), and therefore qualifies as a “one-liner” if you squint in
just the right way.
This simple prime-finding algorithm is nowhere near as fast as
the Sieve of Eratosthenes shown in Recipe 18.10, but the
ability to fit the entire algorithm inside a single expression is
nevertheless kind of neat. Part of its neatness comes from the
just-in-time evaluation that the functions from standard library
module itertools
perform so
nicely.
Alas, this neat trick definitely cannot be
recommended for production code. While it works in Python 2.3 and 2.4,
it could easily break in future releases, since it depends on
undocumented internals; for the same reason, it’s unlikely to work
properly on other implementations of the Python language, such as
Jython or IronPython. So, I suggest you use it to impress friends, but
for any real work, stick to clearer, faster, and solid good old
for
loops!
Documentation for bound methods, list
s’ append
method, and the itertools
module in the Library
Reference and Python in a
Nutshell.
Credit: Alexander Semenov
You often use py2exe
to build Windows .exe files from
Python scripts, but you don’t want to bother writing a setup.py build script for each and every
such script.
distutils
is a package in the standard Python library, ready to
be imported from your Python code. py2exe
is a third-party extension to
distutils
for the specific task of
generating Windows executables from Python code: you must download and
install py2exe
separately, but once
installed, it cooperates smoothly with the standard distutils
. Thanks to these features, you can
easily write Python scripts to automate distutils
tasks (including py2exe
tasks). For example:
from distutils.core import setup import sys, os, py2exe # the key trick with our arguments and Python's sys.path name = sys.argv[1] sys.argv[1] = 'py2exe' sys.path.append(os.path.dirname(os.path.abspath(name))) setup(name=name[:-3], scripts=[name])
Save this as makexe.py in
the ToolsScripts folder of your
Python installation. (You should always add this folder to your
Windows PATH
because it contains
many useful tools.) Now, from a Windows command prompt, you’re able to
cd
to a directory where you have
placed a script (say C:MyDir),
and there run, say:
C:MyDir> makexe.py myscript.py
and (assuming that you have a myscript.py script there, and .py among your Windows executable
extensions, with association to the Python interpreter) py2exe
prepares all the files you need for
distributing your masterpiece (as a Windows executable and supporting
DLLs), neatly arranged in folder c:MyDirdistmyscript.
The distutils
package is part
of the Python Standard Library. It helps you prepare your Python
modules and extensions for distribution, as well as letting you
install such packages as distributed by others. py2exe
is a freely downloadable third-party
extension that works on top of distutils
to help you build a Windows
.exe file (and a set of
supporting DLL files) from a Python-coded program, so that you can
distribute your program in executable form to other Windows PCs that
may not have Python installed; see http://starship.python.net/crew/theller/py2exe/,
both to download py2exe
and for
detailed documentation of this useful tool.
Following the details given in the distutils
(and py2exe
) documentation, the canonical way to
use distutils
(including py2exe
) is to write a script, conventionally
always named setup.py, to perform
all kinds of distutils
tasks on
your package. Normally, you write a setup.py for each package you distribute,
placing it in the top directory of the package (known as the
distribution root in distutils
terminology).
However, there is nothing mandatory about the convention of
writing a setup.py script per
package. distutils
and py2exe
, after all, are written as modules to
be imported from Python. So, you can, if you so choose, use all the
power of Python to code scripts that help you perform distutils
and py2exe
tasks in whatever ways you find most
convenient.
This recipe shows how I eliminate the need to write a separate
setup.py script for each Python
script that I convert to an executable with py2exe
, and related issues such as the need
to keep such scripts in dedicated “distribution root” directories. I
suggest you name this recipe’s script makexe.py, but any name will do, as long as
you avoid naming it py2exe.py (a
natural enough temptation). (Naming it py2exe.py would break the script because
the script must import py2exe
, and
if you named the script py2exe.py
it would “import itself” instead!)
Place this script on any directory on your Windows PATH
where you normally keep executable
Python scripts. I suggest you use the ToolsScripts folder of the Python
distribution, a folder that contains several other useful scripts
you’ll want to have handy (have a look in that folder—it’s worth your
time). I’m not going to delve into the details of how to set and
examine your Windows PATH
, open a
command prompt, make your Python scripts executable, and so on. Such
system administration details differ slightly on each version of
Windows, and you’ll need to master them for any Windows version on
which you want to perform significant programming, anyway.
Once you have implemented this Solution, you’ll find that making your Python scripts into Windows executables has become so easy and so fast that soon you’ll be distributing your neat and useful programs to friends and acquaintances right and left. You won’t need to convince them to install the Python runtime files before they can install and run your programs, either! (Of course, in this way they will end up with what amounts to several copies of the runtime files, if they install several of your compiled programs—there is little you can do about that.)
The section “Distributing Python Modules” of the standard Python
documentation set is still incomplete but a good source of information
on the distutils
package;
Python in a Nutshell covers the essentials of
the distutils
package; py2exe
is at http://starship.python.net/crew/theller/py2exe/.
Credit: Joerg Raedler
You have a Python application composed of a main script and some additional modules. You want to bind the script and modules into one executable file, so that no installation procedure is necessary.
Prepare the following mixed sh/Python script and save it as file zipheader.unix:
#!/bin/sh PYTHON=$(which python 2>/dev/null) if [ x ! -x "x$PYTHON" ] ; then echo "python executable not found - cannot continue!" exit 1 fi exec $PYTHON - $0 $@ << END_OF_PYTHON_CODE import sys version = sys.version_info[:2] if version < (2, 3): print 'Sorry, need Python 2.3 or better; %s.%s is too old!' % version sys.path.insert(0, sys.argv[1]) del sys.argv[0:2] import main main.main( ) END_OF_PYTHON_CODE
Make sure you have the Python bytecode files for the main script
of your application (file main.pyc, containing a function named
main
, which starts the application when called
without arguments) and any additional modules your application needs
(e.g., files spam.pyc and
eggs.pyc). Make a zip file out of
them all:
$ zip myapp.zip main.pyc spam.pyc eggs.pyc
(If you prefer, you can build the zip file with an auxiliary Python program, of course.) Next, concatenate the “header” and the zip file, and make the resulting file executable:
$ cat zipheader.unix myapp.zip > myapp $ chmod +x myapp
That’s all! Your application is now contained in this executable file myapp. When myapp runs, the shell /bin/sh sets things up and replaces itself with the Python interpreter. The Python interpreter reopens the file as a zip file, skipping the “header” text, and finds all needed modules in the zip file itself.
On Windows machines, you would normally use py2exe for similar tasks, as shown previously in Recipe 16.11; on Mac OS X, you would normally use py2app (although this recipe works just as well on Mac OS X as it does on any other Unix).
This recipe is particularly useful for Linux and other Unix
variants that come with Python installed. By following the steps
outlined in this recipe’s Solution, you can distribute a Python
application as a single, self-contained standalone executable file,
which runs on any version of Unix, on any hardware platform—as long as
your Python application does not need any C-coded extension modules
beyond the ones that come with Python itself. When you do need more,
you can use Python’s own distutil
package to perform more complicated packaging tasks. But for many
simple Python applications and quite a few that aren’t all that
simple, this recipe can be very useful, since it results in a file
that can just be run as is, without needing any
kind of “installation” step!
The key idea of this recipe is to exploit Python’s ability to import modules from a zip file, while skipping leading text that may precede the zip file itself. Here, as leading text, we use a small shell script that turns itself into a Python script, and within the same file continues with the zip file from which everything gets imported. The concept of importing from a zip file is described in Recipe 2.9.
In the zip file, you may, if you wish, place Python source files (with extension .py), as well as compiled bytecode files (with extension .pyc); the latter option is often preferable because if you zip up source files, Python compiles them every time you run the application, slowing your application’s startup. On the other hand, if you zip up compiled bytecode files, your application may be unable to run with versions of Python that are newer than the one you used to prepare the bytecode files, since binary compatibility of bytecode files is not guaranteed across Python releases. The best approach may be to place both sources and bytecodes in the zip file.
You may also choose to zip up optimized
bytecode files (with extension .pyo)—if you do so, you need to add the
flag -O
right after the $PYTHON
in the shell script in this recipe’s
Solution. Execution speed doesn’t generally change much, but optimized
execution skips assert
statements,
which may be important to you. Also, if you prepare the .pyo files by running Python with option
-OO
, all docstrings are eliminated,
which may slightly reduce your application’s size on disk (although
docstrings tend to compress well, so that size advantage may be
minor).
If you need help in finding all the modules that you need to
place in the zip file, see the modulefinder
module in the Python Standard
Library. Unfortunately, no real documentation about it is available at
the time of this writing, but just running (in version 2.3) something
like:
$ python /usr/lib/python2.3/modulefinder.py main.py
should help (you may have to change the change the path to the modulefinder.py script, depending on your Python installation). With Python 2.4, you can just use the handy new -m switch:
$ python -mmodulefinder main.py
Python 2.4’s -m switch lets
you run as the main script any module that’s on Python’s sys.path
—a very convenient little
feature!
Recipe 16.11;
Recipe 2.9; the
sources of modules modulefinder
and
zipimport
(which are not yet
documented in the Library Reference at the time
of writing).
[1] I’d even call this book the ultimate reference, were it not for the fact that Donald Knuth continues to promise that the fifth volume (current ETA, the year 2010) of his epoch-making The Art of Computer Programming will be about this very subject.