16. Programs About Programs

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 16. Programs About Programs

Introduction

Credit: Paul F. Dubois, Ph.D., Program for Climate Model Diagnosis and Intercomparison, Lawrence Livermore National Laboratory

This chapter was originally meant to cover mainly topics such as lexing, parsing, and code generation—the classic issues of programs that are about programs. It turns out, however, that Pythonistas did not post many recipes about such tasks, focusing more on highly Python-specific topics such as program introspection, dynamic importing, and generation of functions by closure. Many of those recipes, we decided, were more properly located in various other chapters—on shortcuts, debugging, object oriented programming, algorithms, metaprogramming, and specific areas such as the handling of text, files, and persistence Therefore, you will find those topics covered in other chapters. In this chapter, we included only those recipes that are still best described as programs about programs. Of these, probably the most important one is that about currying, the creation of new functions by predetermining some arguments of other functions.

This arrangement doesn’t mean that the classic issues aren’t important! Python has extensive facilities related to lexing and parsing, as well as a large number of user-contributed modules related to parsing standard languages, which reduces the need for doing your own programming. If Pythonistas are not using these tools, then, in this one area, they are doing more work than they need to. Lexing and parsing are among the most common of programming tasks, and as a result, both are the subject of much theory and much prior development. Therefore, in these areas more than most, you will often profit if you take the time to search for solutions before resorting to writing your own. This Introduction contains a general guide to solving some common problems in these categories to encourage reusing the wide base of excellent, solid code and theory in these fields.

Lexing

Lexing is the process of dividing an input stream into meaningful units, known as tokens, which are then processed. Lexing occurs in tasks such as data processing and in tools for inspecting and modifying text.

The regular expression facilities in Python are extensive and highly evolved, so your first consideration for a lexing task is often to determine whether it can be formulated using regular expressions. Also, see the next section about parsers for common languages and how to lex those languages.

The Python Standard Library tokenize module splits an input stream into Python-language tokens. Since Python’s tokenization rules are similar to those of many other languages, this module may often be suitable for other tasks, perhaps with a modest amount of pre- and/or post-processing around tokenize’s own operations. For more complex tokenization tasks, Plex, http://nz.cosc.canterbury.ac.nz/~greg/python/Plex/, can ease your efforts considerably.

At the other end of the lexing complexity spectrum, the built-in string method split can also be used for many simple cases. For example, consider a file consisting of colon-separated text fields, with one record per line. You can read a line from the file as follows:

fields = line.split(':')

This produces a list of the fields. At this point, if you want to eliminate spurious whitespace at the beginning and ends of the fields, you can remove it:

fields = [f.strip( ) for f in fields]

For example:

>>> x = "abc :def:ghi    : klm
"
>>> fields = x.split(':')
>>> print fields['abc ', 'def', 'ghi    ', ' klm
']
>>> print [f.strip( ) for f in fields]
['abc', 'def', 'ghi', 'klm']

Do not elaborate on this example: do not try to over-enrich simple code to perform lexing and parsing tasks which are in fact quite hard to perform with generality, solidity, and good performance, and for which much excellent, reusable code exists. For parsing typical comma-separated values files, or files using other delimiters, study the standard Python library module csv. The ScientificPython package, http://starship.python.net/~hinsen/ScientificPython/, includes a module for reading and writing with Fortran-like formats, and other such precious I/O modules, in the Scientific.IO sub-package.

A common “gotcha” for beginners is that, while lexing and other text-parsing techniques can be used to read numerical data from a file, at the end of this stage, the entries are text strings, not numbers. The int and float built-in functions are frequently needed here, to turn each field from a string into a number:

>>> x = "1.2, 2.3, 4, 5.6"
>>> print [float(y.strip( )) for y in x.split(',')][1.2, 2.2999999999999998, 4.0, 5.5999999999999996]

Parsing

Parsing refers to discovering semantic meaning from a series of tokens according to the rules of a grammar. Parsing tasks are quite ubiquitous. Programming tools may attempt to discover information about program texts or to modify such texts to fit a task. (Python’s introspection capabilities come into play here, as we will discuss later.) Little languages is the generic name given to application-specific languages that serve as human-readable forms of computer input. Such languages can vary from simple lists of commands and arguments to full-blown languages.

The grammar in the previous lexing example was implicit: the data you need is organized as one line per record with the fields separated by a special character. The “parser” in that case was supplied by the programmer reading the lines from the file and applying the simple split method to obtain the information. This sort of input file can easily grow, leading to requests for a more elaborate form. For example, users may wish to use comments, blank lines, conditional statements, or alternate forms. While most such parsing can be handled with simple logic, at some point, it becomes so complicated that it is much more reliable to use a real grammar.

There is no hard-and-fast way to decide which part of the job is a lexing task and which belongs to the grammar. For example, comments can often be discarded in the lexing, but doing so is not wise in a program-transformation tool that must produce output containing the original comments.

Your strategy for parsing tasks can include:

Using a parser for that language from the Python Standard Library.
Using a parser from the user community. You can often find one by visiting the Vaults of Parnassus site, http://www.vex.net/parnassus/, or by searching the Python site, http://www.python.org.
Generating a parser using a parser generator.
Using Python itself as your input language.

A combination of approaches is often fruitful. For example, a simple parser can turn input into Python-language statements, which Python then executes in concert with a supporting package that you supply.

A number of parsers for specific languages exist in the standard library, and more are out there on the Web, supplied by the user community. In particular, the standard library includes parsing packages for XML, HTML, SGML, command-line arguments, configuration files, and for Python itself. For the now-ubiquitous task of parsing XML specifically, this cookbook includes a chapter—Chapter 14 , specifically dedicated to XML.

You do not have to parse C to connect C routines to Python. Use SWIG (http://www.swig.org). Likewise, you do not need a Fortran parser to connect Fortran and Python. See the Numerical Python web page at http://www.pfdubois.com/numpy/ for further information. Again, this cookbook includes a chapter, Chapter 17 , which is dedicated to these kind of tasks.

PLY, SPARK, and Other Python Parser Generators

PLY and SPARK are two rich, solid, and mature Python-based parser generators. That is, they take as their input some statements that describe the grammar to be parsed and generate the parser for you. To make a useful tool, you must add the semantic actions to be taken when a certain construct in the grammar is recognized.

PLY (http://systems.cs.uchicago.edu/ply) is a Python implementation of the popular Unix tool yacc. SPARK (http://pages.cpcc.ucalgary-ca/~aycoch/spart/content.html) parses a more general set of grammars than yacc. Both tools use Python introspection, including the idea of placing grammar rules in functions’ docstrings.

Parser generators are one of the many application areas that may have even too many excellent tools, so that you may end up frustrated by having to pick just one. Besides SPARK and PLY, other Python tools in this field include TPG (Toy Parser Generator), DParser, PyParsing, kwParsing (or kyParsing), PyLR, Yapps, PyGgy, mx.TextTools and its SimpleParse frontend—too many to provide more than a bare mention of each, so, happy googling!

The chief problem in using any of these tools is that you need to educate yourself about grammars and learn to write them. A novice without any computer science background will encounter some difficulty except with very simple grammars. A lot of literature is available to teach you how to use yacc, and most of this knowledge will help you use SPARK and most of the others just as well.

If you are interested in this area, the penultimate reference is Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman, Compilers (Addison-Wesley), affectionately known as “the Dragon Book” to generations of computer science majors.^[1]

Using Python Itself as a Little Language

Python itself can be used to create many application-specific languages. By writing suitable classes, you can rapidly create a program that is easy to get running, yet is extensible later. Suppose I want a language to describe graphs. Nodes have names, and edges connect the nodes. I want a way to input such graphs, so that after reading the input I will have the data structures in Python that I need for any further processing. So, for example:

nodes = {  }
def getnode(name):
    " Return the node with the given name, creating it if necessary. "
    if name in nodes:
        node = nodes[name]
    else:
        node = nodes[name] = node(name)
    return node
class node(object):
     " A node has a name and a list of edges emanating from it. "
    def _ _init_ _(self, name):
        self.name = name
        self.edgelist = [  ]
class edge(object):
    " An edge connects two nodes. "
    def _ _init_ _(self, name1, name2):
        self.nodes = getnode(name1), getnode(name2)
        for n in self.nodes:
            n.edgelist.append(self)
    def _ _repr_ _(self):
        return self.nodes[0].name + self.nodes[1].name

Using just these simple statements, I can now parse a list of edges that describe a graph, and afterwards, I will now have data structures that contain all my information. Here, I enter a graph with four edges and print the list of edges emanating from node 'A':

>>> edge('A', 'B')
>>> edge('B', 'C')
>>> edge('C', 'D')
>>> edge('C', 'A')
>>> print getnode('A').edgelist[AB, CA]

Suppose that I now want a weighted graph. I could easily add a weight=1.0 default argument to the edge constructor, and the old input would still work. Also, I could easily add error-checking logic to ensure that edge lists have no duplicates. Furthermore, I already have my node class and can start adding logic to it for any needed processing purposes, be it directly or by subclassing. I can easily turn the entries in the dictionary nodes into similarly named variables that are bound to the node objects. After adding a few more classes corresponding to other input I need, I am well on my way.

The advantage to this approach is clear. For example, the following is already handled correctly:

edge('A', 'B')
if 'X' in nodes:
    edge('X', 'A')
def triangle(n1, n2, n3):
    edge(n1, n2)
    edge(n2, n3)
    edge(n3, n1)
triangle('A','W','K')
execfile('mygraph.txt')     # Read graph from a datafile

So I already have syntactic sugar, user-defined language extensions, and input from other files. The definitions usually go into a module, and the user simply import them. Had I written my own language, instead of reusing Python in this little language role, such accomplishments might be months away.

Introspection

Python programs have the ability to examine themselves; this set of facilities comes under the general title of introspection. For example, a Python function object knows a lot about itself, including the names of its arguments, and the docstring that was given when it was defined:

>>> def f(a, b):
        " Return the difference of a and b "
        return a-b
... 
>>> dir(f)['_ _call_ _', '_ _class_ _', '_ _delattr_ _', '_ _dict_ _', '_ _doc_ _',
               '_ _get_ _', '_ _getattribute_ _', '_ _hash_ _', '_ _init_ _', '_ _module_ _',
               '_ _name_ _', '_ _new_ _', '_ _reduce_ _', '_ _reduce_ex_ _', '_ _repr_ _',
               '_ _setattr_ _', '_ _str_ _', 'func_closure', 'func_code', 'func_defaults',
               'func_dict', 'func_doc', 'func_globals', 'func_name']
>>> f.func_name
'f'
>>> f.func_doc
'Return the difference of a and b'
>>> f.func_code
<code object f at 0175DDF0, file "<pyshell#18>", line 1>
>>> dir (f.func_code)
['_ _class_ _', '_ _cmp_ _', '_ _delattr_ _', '_ _doc_ _',
               '_ _getattribute_ _', '_ _hash_ _', '_ _init_ _', '_ _new_ _', '_ _reduce_ _',
               '_ _reduce_ex_ _', '_ _repr_ _', '_ _setattr_ _', '_ _str_ _', 'co_argcount',
               'co_cellvars', 'co_code', 'co_consts', 'co_filename', 'co_firstlineno',
               'co_flags', 'co_freevars', 'co_lnotab', 'co_name', 'co_names',
               'co_nlocals', 'co_stacksize', 'co_varnames']
>>> f.func_code.co_names
('a', 'b')

SPARK and PLY make an interesting use of introspection. The grammar is entered as docstrings in the routines that take the semantic actions when those grammar constructs are recognized. (Hey, don’t turn your head all the way around like that! Introspection has its limits.)

Introspection is very popular in the Python community, and you will find many examples of it in recipes in this book, both in this chapter and elsewhere. Even in this field, though, always remember the possibility of reuse! Standard library module inspect has a lot of solid, reusable inspection-related code. It’s all pure Python code, and you can (and should) study the inspect.py source file in your Python library to see what “raw” facilities underlie inspect’s elegant high-level functions—indeed, this suggestion generalizes: studying the standard library’s sources is among the best things you can do to increment your Python knowledge and skill. But reusing the standard library’s wealth of modules and packages is still best: any code you don’t write is code you don’t have to maintain, and solid, heavily tested code such as the code that you find in the standard library is very likely to have far fewer bugs than any newly developed code you might write yourself.

Python is the most powerful language that you can still read. The kinds of tasks discussed in this chapter help to show just how versatile and powerful it really is.

16.1. Verifying Whether a String Represents a Valid Number

Credit: Gyro Funch, Rogier Steehouder

Problem

You need to check whether a string read from a file or obtained from user input has a valid numeric format.

Solution

The simplest and most Pythonic approach is to “try and see”:

def is_a_number(s):
    try: float(s)
    except ValueError: return False
    else: return True

Discussion

If you insist, you can also perform this task with a regular expression:

import re
num_re = re.compile(r'^[-+]?([0-9]+.?[0-9]*|.[0-9]+)([eE][-+]?[0-9]+)?$')
def is_a_number(s):
    return bool(num_re.match(s))

Having a regular expression to start from may be best if you need to be tolerant of certain specific variations, or to pick up numeric substrings from the middle of larger strings. But for the specific task posed as this recipe’s Problem, it’s simplest and best to “let Python do it!”

16.2. Importing a Dynamically Generated Module

Credit: Anders Hammarquist

Problem

You need to wrap code in either compiled or source form in a module, possibly adding it to sys.modules as well.

Solution

We build a new module object, optionally add it to sys.modules, and populate it with an exec statement:

import new
def importCode(code, name, add_to_sys_modules=False):
    """ code can be any object containing code: a string, a file object, or
        a compiled code object.  Returns a new module object initialized
        by dynamically importing the given code, and optionally adds it
        to sys.modules under the given name.
    """module = new.module(name)
    if add_to_sys_modules:
        import sys
        sys.modules[name] = module
               exec code in module._ _dict_ _
    return module

Discussion

This recipe lets you import a module from code that is dynamically generated or obtained. My original intent for it was to import a module stored in a database, but it will work for modules from any source. Thanks to the flexibility of the exec statement, the importCode function can accept code in many forms: a string of source (which gets implicitly compiled on the fly), a file object (ditto), or a previously compiled code object.

The addition of the newly generated module to sys.modules is optional. You shouldn’t normally do so for such dynamically obtained code, but there are exceptions—specifically, when import statements for the module’s name are later executed, and it’s important that they retrieve from sys.modules your dynamically generated module. If you want the sys.modules addition, it’s best to perform it before the module’s code body executes, just as normal import statements do, in case the code body relies on that normal behavior (which it usually doesn’t, but it can’t hurt to be prepared).

Note that the normal Python statement:

import foo

in simple cases (where no hooks, built-in modules, imports from zip files, etc., come into play!) is essentially equivalent to:

if 'foo' in sys.modules:
    foo = sys.modules['foo']
else:
    foofile = open("/path/to/foo.py")      # for some suitable /path/to/...
    foo = importCode(foofile, "foo", 1)

A toy example of using this recipe:

code = """
def testFunc( ):
    print "spam!"
class testClass(object):
    def testMethod(self):
        print "eggs!"
"""
m = importCode(code, "test")
m.testFunc( )
o = m.testClass( )
o.testMethod( )

16.3. Importing from a Module Whose Name Is Determined at Runtime

Credit: Jürgen Hermann

Problem

You need to import a name from a module, just as from module import name would do, but module and name are runtime-computed expressions. This need often arises, for example, when you want to support user-written plug-ins.

Solution

The _ _import_ _ built-in function lets you perform this task:

def importName(modulename, name):
    """ Import a named object from a module in the context of this function.
    """
    try:module = _ _import_ _(modulename, globals( ), locals( ), [name])
    except ImportError:
        return None
    return getattr(module, name)

Discussion

This recipe’s function lets you perform the equivalent of from module import name, in which either or both module and name are dynamic values (i.e., expressions or variables) rather than constant strings. For example, this functionality can be used to implement a plug-in mechanism to extend an application with external modules that adhere to a common interface.

Some programmers’ instinctive reaction to this task would be to use exec, but this instinct would be a pretty bad one. The exec statement is too powerful, and therefore is a last-ditch measure, to be used only when nothing else is available (which is almost never). It’s just too easy to have horrid bugs and/or security weaknesses where exec is used. In almost all cases, there are better ways. This recipe shows one such way for an important problem.

For example, suppose you have, in a file named MyApp/extensions/spam.py, the following code:

class Handler(object):
    def handleSomething(self):
        print "spam!"

and, in a file named MyApp/extensions/eggs.py:

class Handler(object):
    def handleSomething(self):
        print "eggs!"

We must also suppose that the MyApp directory is in a directory on sys.path, and both it and the extensions subdirectory are identified as Python packages (meaning that each of them must contain a file, possibly empty, named _ _init_ _.py). Then, we can get and call both implementations with the following code:

for extname in 'spam', 'eggs':
    HandlerClass = importName("MyApp.extensions." + extname, "Handler")
    handler = HandlerClass( )
    handler.handleSomething( )

It’s possible to remove the constraints about sys.path and _ _init_ _.py, and dynamically import from anywhere, with the imp standard module. However, imp is substantially harder to use than the _ _import_ _ built-in function, and you can generally arrange things to avoid imp’s greater generality and difficulty.

The import pattern implemented by this recipe is used in MoinMoin (http://moin.sourceforge.net/) to load extensions implementing variations of a common interface, such as action, macro, and formatter.

16.4. Associating Parameters with a Function (Currying)

Credit: Scott David Daniels, Nick Perkins, Alex Martelli, Ben Wolfson, Alex Naanou, David Abrahams, Tracy Ruggles

Problem

You need to wrap a function (or other callable) to get another callable with fewer formal arguments, keeping given values fixed for the other arguments (i.e., you need to curry a callable to make another).

Solution

Curry is not just a delightful spice used in Asian cuisine—it’s also an important programming technique in Python and other languages:

def curry(f, *a, **kw):
    def curried(*more_a, **more_kw):
        return f(*(a+more_a), **dict(kw, **more_kw))
    return curried

Discussion

Popular in functional programming, currying is a way to bind some of a function’s arguments and wait for the rest of them to show up later. Currying is named in honor of Haskell Curry, a mathematician who laid some of the cornerstones in the theory of formal systems and processes. Some pedants (and it must be grudgingly admitted they have a point) claim that the technique shown in this recipe should be called partial application, and that “currying” is something else. But whether they’re right or wrong, in a book whose title claims it’s a cookbook, the use of curry in a title was simply irresistible. Besides, the use of the verb to curry that this recipe supports is the most popular one among programmers.

The curry function defined in this recipe is invoked with a callable and some or all of the arguments to the callable. (Some people like to refer to functions that accept function objects as arguments, and return new function objects as results, as higher-order functions.) The curry function returns a closure curried that takes subsequent parameters as arguments and calls the original with all of those parameters. For example:

double = curry(operator.mul, 2)
triple = curry(operator.mul, 3)

To implement currying, the choice is among closures, classes with callable instances, and lambda forms. Closures are simplest and fastest, so that’s what we use in this recipe.

A typical use of curry is to construct callback functions for GUI operations. When the operation does not merit a new function name, curry can be useful in creating these little functions. For example, this can be the case with commands for Tkinter buttons:

self.button = Button(frame, text='A', command=curry(transcript.append, 'A'))

Recipe 11.2 shows a specialized subset of “curry” functionality intended to produce callables that require no arguments, which are often needed for such GUI-callback usage. However, this recipe’s curry function is vastly more flexible, without any substantial extra cost in either complexity or performance.

Currying can also be used interactively to make versions of your functions with debugging-appropriate defaults, or initial parameters filled in for your current case. For example, database debugging work might begin by setting:

Connect = curry(ODBC.Connect, dsn='MyDataSet')

Another example of the use of curry in debugging is to wrap methods:

def report(originalFunction, name, *args, **kw):
    print "%s(%s)"%(name, ', '.join(map(repr, args) +
                                   [k+'='+repr(kw[k]) for k in kw])
    result = originalFunction(*args, **kw)
    if result: print name, '==>', result
    return result
class Sink(object):
    def write(self, text): pass
dest = Sink( )
dest.write = curry(report, dest.write, 'write')
print >>dest, 'this', 'is', 1, 'test'

If you are creating a function for regular use, the def fun form of function definition is more readable and more easily extended. As you can see from the implementation, no magic happens to specialize the function with the provided parameters. curry should be used when you feel the code is clearer with its use than without. Typically, this use will emphasize that you are only providing some pre-fixed parameters to a commonly used function, not providing any separate processing.

Currying also works well in creating a “lightweight subclass”. You can curry the constructor of a class to give the illusion of a subclass:

BlueWindow = curry(Window, background="blue")

BlueWindow._ _class_ _ is still Window, not a subclass, but if you’re changing only default parameters, not behavior, currying is arguably more appropriate than subclassing anyway. And you can still pass additional parameters to the curried constructor.

Two decisions must be made when coding a curry implementation, since both positional and keyword arguments can come in two “waves”—some at currying time, some more at call time. The two decisions are: do the call-time positional arguments go before or after the currying-time ones? do the call-time keyword arguments override currying-time ones, or vice versa? If you study this recipe’s Solution, you can see I’ve made these decisions in a specific way (the one that is most generally useful): call-time positional arguments after currying-time ones, call-time keyword arguments overriding currying-time ones. In some circles, this is referred to as left-left partial application. It’s trivial to code other variations, such as right-left partial application:

def rcurry(f, *a, **kw):
    def curried(*more_a, **more_kw):
        return f(*(more_a+a), **dict(kw, **more_kw))
    return curried

As you can see, despite the grandiose-sounding terms, this is just a matter of concatenating more_a+a rather than the reverse; and similarly, for keyword arguments, you just need to call dict(more_kw, **kw) if you want currying-time keyword arguments to override call-time ones rather than vice versa.

If you wish, you could have the curried function carry a copy of the original function’s docstring, or even (easy in Python 2.4, but feasible, with a call to new.function, even in 2.3—see the sidebar in Recipe 20.1) a name that is somehow derived from the original function. However, I have chosen not to do so because the original name, and argument descriptions in the docstring, are probably not appropriate for the curried version. The task of constructing and documenting the actual signature of the curried version is also feasible (with a liberal application of the helper functions from standard library module inspect), but it’s so disproportionate an effort, compared to curry’s delightfully simple four lines of code (!), that I resolutely refuse to undertake it.

A special case, which may be worth keeping in mind, is when the callable you want to curry is a Python function (not a bound method, a C-coded function, a callable class instance, etc.), and all you need to curry is the first parameter. In this case, the function object’s _ _get_ _ special method may be all you need. It takes an arbitrary argument and returns a bound-method object with the first parameter bound to that argument. For example:

>>> def f(adj, noun='world'):
...     return 'Goodbye, %s %s!' % (adj, noun)
... 
>>> cf = f._ _get_ _('cruel')
>>> print cf( )Goodbye, cruel world!
>>> cf
<bound method ?.f of 'cruel'>
>>> type(cf)
<type 'instancemethod'>
>>> cf.im_func
<function f at 0x402dba04>
>>> cf.im_self
'cruel'

16.5. Composing Functions

Credit: Scott David Daniels

Problem

You need to construct a new function by composing existing functions (i.e., each call of the new function must call one existing function on its arguments, then another on the result of the first one).

Solution

Composition is a fundamental operation between functions and yields a new function as a result. The new function must call one existing function on its arguments, then another on the result of the first one. For example, a function that, given a string, returns a copy that is lowercase and does not have leading and trailing blanks, is the composition of the existing string.lower and string.strip functions. (In this case, it does not matter in which order the two existing functions are applied, but generally, it could be important.)

A closure (a nested function returned from another function) is often the best Pythonic approach to constructing new functions:

def compose(f, g, *args_for_f, **kwargs_for_f):
    ''' compose functions.  compose(f, g, x)(y) = f(g(y), x)) '''
    def fg(*args_for_g, **kwargs_for_g):
        return f(g(*args_for_g, **kwargs_for_g), *args_for_f, **kwargs_for_f)
    return fg
def mcompose(f, g, *args_for_f, **kwargs_for_f):
    ''' compose functions.  mcompose(f, g, x)(y) = f(*g(y), x)) '''
    def fg(*args_for_g, **kwargs_for_g):
        mid = g(*args_for_g, **kwargs_for_g)
        if not isinstance(mid, tuple):
            mid = (mid,)
        return f(*(mid+args_for_f), **kwargs_for_f)
    return fg

Discussion

The closures in this recipe show two styles of function composition. I separated mcompose and compose because I think of the two possible forms of function composition as being quite different, in mathematical terms. In practical terms, the difference shows only when the second function being composed, g, returns a tuple. The closure returned by compose passes the result of g as f’s first argument anyway, while the closure returned by mcompose treats it as a tuple of arguments to pass along. Any extra arguments provided to either compose or mcompose are treated as extra arguments for f (there is no standard functional behavior to follow here):

compose(f, g, x)(y) = f(g(y), x)
mcompose(f, g, x)(y) = f(*g(y), x)

As in currying (see Recipe 16.4), this recipe’s functions are for constructing functions from other functions. Your goal in so doing should be clarity, since no efficiency is gained by using these functional forms.

Here’s a quick example for interactive use:

parts = compose(' '.join, dir)

When called on a module object, the callable we just bound to name parts gives you an easy-to-view string that lists the module’s contents.

16.6. Colorizing Python Source Using the Built-in Tokenizer

Credit: Jürgen Hermann, Mike Brown

Problem

You need to convert Python source code into HTML markup, rendering comments, keywords, operators, and numeric and string literals in different colors.

Solution

tokenize.generate_tokens does most of the work. We just need to loop over all tokens it finds, to output them with appropriate colorization:

""" MoinMoin - Python Source Parser """
import cgi, sys, cStringIO
import keyword, token, tokenize
# Python Source Parser (does highlighting into HTML)
_KEYWORD = token.NT_OFFSET + 1
_TEXT    = token.NT_OFFSET + 2
_colors = {
    token.NUMBER:       '#0080C0',
    token.OP:           '#0000C0',
    token.STRING:       '#004080',
    tokenize.COMMENT:   '#008000',
    token.NAME:         '#000000',
    token.ERRORTOKEN:   '#FF8080',
    _KEYWORD:           '#C00000',
    _TEXT:              '#000000',
}
class Parser(object):
    """ Send colorized Python source HTML to output file (normally stdout).
    """
    def _ _init_ _(self, raw, out=sys.stdout):
        """ Store the source text. """
        self.raw = raw.expandtabs( ).strip( )
        self.out = out
    def format(self):
        """ Parse and send the colorized source to output. """
        # Store line offsets in self.lines
        self.lines = [0, 0]
        pos = 0
        while True:
            pos = self.raw.find('
', pos) + 1
            if not pos: break
            self.lines.append(pos)
        self.lines.append(len(self.raw))
        # Parse the source and write it
        self.pos = 0
        text = cStringIO.StringIO(self.raw)
        self.out.write('<pre><font face="Lucida, Courier New">')
        try:
            for tok in tokenize.generate_tokens(text.readline):
                # unpack the components of each token
                toktype, toktext, (srow, scol), (erow, ecol), line = tok
                if False:  # You may enable this for debugging purposes only
                    print "type", toktype, token.tok_name[toktype],
                    print "text", toktext,
                    print "start", srow,scol, "end", erow,ecol, "<br>"
                # Calculate new positions
                oldpos = self.pos
                newpos = self.lines[srow] + scol
                self.pos = newpos + len(toktext)
                # Handle newlines
                if toktype in (token.NEWLINE, tokenize.NL):
                    self.out.write('
')
                    continue
                # Send the original whitespace, if needed
                if newpos > oldpos:
                    self.out.write(self.raw[oldpos:newpos])
                # Skip indenting tokens, since they're whitespace-only
                if toktype in (token.INDENT, token.DEDENT):
                    self.pos = newpos
                    continue
                # Map token type to a color group
                if token.LPAR <= toktype <= token.OP:
                    toktype = token.OP
                elif toktype == token.NAME and keyword.iskeyword(toktext):
                    toktype = _KEYWORD
                color = _colors.get(toktype, _colors[_TEXT])
                style = ''
                if toktype == token.ERRORTOKEN:
                    style = ' style="border: solid 1.5pt #FF0000;"'
                # Send text
                self.out.write('<font color="%s"%s>' % (color, style))
                self.out.write(cgi.escape(toktext))
                self.out.write('</font>')
        except tokenize.TokenError, ex:
            msg = ex[0]
            line = ex[1][0]
            self.out.write("<h3>ERROR: %s</h3>%s
" % (
                msg, self.raw[self.lines[line]:]))
        self.out.write('</font></pre>')
if _ _name_ _ == "_ _main_ _":
    print "Formatting..."
    # Open own source
    source = open('python.py').read( )
    # Write colorized version to "python.html"
    Parser(source, open('python.html', 'wt')).format( )
    # Load HTML page into browser
    import webbrowser
    webbrowser.open("python.html")

Discussion

This code is part of MoinMoin (see http://moin.sourceforge.net/) and shows how to use the built-in keyword, token, and tokenize modules to scan Python source code and re-emit it with appropriate color markup but no changes to its original formatting (“no changes” is the hard part!).

The Parser class’ constructor saves the multiline string that is the Python source to colorize, and the file object, which is open for writing, where you want to output the colorized results. Then, the format method prepares a self.lines list that holds the offset (i.e., the index into the source string, self.raw) of each line’s start.

format then loops over the result of generator tokenize.tokenize, unpacking each token tuple into items specifying the token type and starting and ending positions in the source (each expressed as line number and offset within the line). The body of the loop reconstructs the exact position within the original source code string self.raw, so it can emit exactly the same whitespace that was present in the original source. It then picks a color code from the _colors dictionary (which uses HTML color coding), with help from the keyword standard module to determine whether a NAME token is actually a Python keyword (to be output in a different color than that used for ordinary identifiers).

The test code at the bottom of the module formats the module itself and launches a browser with the result, using the standard Python library module webbrowser to enable you to see and enjoy the result in your favorite browser.

If you put this recipe’s code into a module, you can then import the module and reuse its functionality in CGI scripts (using the PATH_TRANSLATED CGI environment variable to know what file to colorize), command-line tools (taking filenames as arguments), filters that colorize anything they get from standard input, and so on. See http://skew.org/~mike/colorize.py for versions that support several of these various possibilities.

With small changes, it’s also easy to turn this recipe into an Apache handler, so your Apache web site can serve colorized .py files. Specifically, if you set up this script as a handler in Apache, then the file is served up as colorized HTML whenever a visitor to the site requests a .py file.

For the purpose of using this recipe as an Apache handler, you need to save the script as colorize.cgi (not .py, lest it confuses Apache), and add, to your .htaccess or httpd.conf Apache configuration files, the following lines:

AddHandler application/x-python .py
Action application/x-python /full/virtual/path/to/colorize.cgi

Also, make sure you have the Action module enabled in your httpd.conf Apache configuration file.

16.7. Merging and Splitting Tokens

Credit: Peter Cogolo

Problem

You need to tokenize an input language whose tokens are almost the same as Python’s, with a few exceptions that need token merging and splitting.

Solution

Standard library module tokenize is very handy; we need to wrap it with a generator to do the post-processing for a little splitting and merging of tokens. The merging requires the ability to “peek ahead” in an iterator. We can get that ability by wrapping any iterator into a small dedicated iterator class:

class peek_ahead(object):
    sentinel = object( )
    def _ _init_ _(self, it):
        self._nit = iter(it).next
        self.preview = None
        self._step( )
    def _ _iter_ _(self):
        return self
    def next(self):
        result = self._step( )
        if result is self.sentinel: raise StopIteration
        else: return result
    def _step(self):
        result = self.preview
        try: self.preview = self._nit( )
        except StopIteration: self.preview = self.sentinel
        return result

Armed with this tool, we can easily split and merge tokens. Say, for example, by the rules of the language we’re lexing, that we must consider each of ':=' and ':+' to be a single token, but a floating-point token that is a '.' with digits on both sides, such as '31.17', must be given as a sequence of three tokens, '31', '.', '17' in this case. Here’s how (using Python 2.4 code with comments on how to change it if you’re stuck with version 2.3):

import tokenize, cStringIO
# in 2.3, also do 'from sets import Set as set'
mergers = {':' : set('=+'), }
def tokens_of(x):
    it = peek_ahead(toktuple[1] for toktuple in
            tokenize.generate_tokens(cStringIO.StringIO(x).readline)
         )
    # in 2.3, you need to add brackets [ ] around the arg to peek_ahead
    for tok in it:
        if it.preview in mergers.get(tok, ( )):
            # merge with next token, as required
            yield tok+it.next( )
        elif tok[:1].isdigit( ) and '.' in tok:
            # split if digits on BOTH sides of the '.'
            before, after = tok.split('.', 1)
            if after:
                # both sides -> yield as 3 separate tokens
                yield before
                yield '.'
                yield after
            else:
                # nope -> yield as one token
                yield tok
        else:
            # not a merge or split case, just yield the token
            yield tok

Discussion

Here’s an example of use of this recipe’s code:

>>> x = 'p{z:=23,  w:+7}: m :+ 23.4'
>>> print ' / '.join(tokens_of(x))p / { / z / := / 23 / , / w / :+ / 7 / } / : / m / :+ / 23 / . / 4 /

In this recipe, I yield tokens only as substrings of the string I’m lexing, rather than the whole tuple yielded by tokenize.generate_tokens, including such items as token position within the overall string (by line and column). If your needs are more sophisticated than mine, you should simply peek_ahead on whole token tuples (while I’m simplifying things by picking up just the substring, item 1, out of each token tuple, by passing to peek_ahead a generator expression), and compute start and end positions appropriately when splitting or merging. For example, if you’re merging two adjacent tokens, the overall token has the same start position as the first, and the same end position as the second, of the two tokens you’re merging.

The peek_ahead iterator wrapper class can often be useful in many kinds of lexing and parsing tasks, exactly because such tasks are well suited to operating on streams (which are well represented by iterators) but often require a level of peek-ahead and/or push-back ability. You can often get by with just one level; if you need more than one level, consider having your wrapper hold a container of peeked-ahead or pushed-back tokens. Python 2.4’s collections.deque container implements a double-ended queue, which is particularly well suited for such tasks. For a more powerful look-ahead iterator wrapper, see Recipe 19.18.

16.8. Checking Whether a String Has Balanced Parentheses

Credit: Peter Cogolo

Problem

You need to check whether a certain string has balanced parentheses, but regular expressions are not powerful enough for this task.

Solution

We want a “true” parser to check a string for balanced parentheses, since parsing theory proves that a regular expression is not sufficient. Choosing one out of the many Python parser generators, we’ll use David Beazley’s classic but evergreen PLY:

# define token names, and a regular expression per each token
tokens = 'OPEN_PAREN', 'CLOS_PAREN', 'OTHR_CHARS'
t_OPEN_PAREN = r'('
t_CLOS_PAREN = r')'
t_OTHR_CHARS = r'[^( )]+'          # RE meaning: one or more non-parentheses
def t_error(t): t.skip(1)
# make the lexer (AKA tokenizer)
import lex
lexer = lex.lex(optimize=1)
# define syntax action-functions, with syntax rules in docstrings
def p_balanced(p):
    ''' balanced : balanced OPEN_PAREN balanced CLOS_PAREN balanced
                 | OTHR_CHARS
                 | '''
    if len(p) == 1:
        p[0] = ''
    elif len(p) == 2:
        p[0] = p[1]
    else:
        p[0] = p[1]+p[2]+p[3]+p[4]+p[5]
def p_error(p): pass
# make the parser (AKA scanner)
import yacc
parser = yacc.yacc( )
def has_balanced_parentheses(s):
    if not s: return True
    result = parser.parse(s, lexer=lexer)
    return s == result

Discussion

Here’s an example of use of this recipe’s code:

>> s = 'ba(be, bi(bo, bu))'
>> print s, is_balanced(s)ba(be, bi(bo, bu)) True
>> s = 'ba(be, bi(bo), bu))'
>> print s, is_balanced(s)
ba(be, bi(bo), bu)) False

The first string has balanced parentheses, but the second one has an extra closed parenthesis; therefore, its parentheses are not balanced.

“How do I check a string for balanced parentheses?” is a frequently asked question about regular expressions. Programmers without a computer science background are often surprised to hear that regular expressions just aren’t powerful enough for this apparently simple task and a more complete form of grammar is required. (Perl’s regular expressions plus arbitrary embedded expressions kitchen sink does suffice—which just proves they aren’t anywhere near “regular” expressions any more!)

For this very simplified parsing problem we’re presenting, any real parser is overkill—just loop over the string’s characters, keeping a running count of the number of open and yet unclosed parentheses encountered at this point, and return False if the running count ever goes negative or doesn’t go back down to exactly 0 at the end:

def has_bal_par(s):
    op = 0
    for c in s:
        if c=='(':
            op += 1
        elif c==')':
            if op == 0:
                return False
            op -= 1
    return op == 0

However, using a parser when you need to parse is still a better idea, in general, than hacking up special-purpose code such as this has_bal_par function. As soon as the problem gets extended a bit (and problems invariably do grow, in real life, in often unpredictable directions), a real parser can grow gracefully and proportionally with the problem, while ad hoc code often must be thrown away and completely rewritten.

All over the web, you can find oodles of Python packages that are suitable for lexing and parsing tasks. My favorite, out of all of them, is still good old PLY, David Beazley’s Python Lexx and Yacc, which reproduces the familiar structure of Unix commands lexx and yacc while taking advantage of Python’s extra power when compared to the C language that those Unix commands support.

You can find PLY at http://systems.cs.uchicago.edu/ply/. PLY is a pure Python package: download it (as a .tgz compressed archive file), decompress and unarchive it (all reasonable archiving tools now support this subtask on all platforms), open a command shell, cd into the directory into which you unarchived PLY, and run the usual python setup.py install, with the proper privileges to be able to write into your Python installation’s site-packages directory (which privileges those are depends on how you installed Python, and on what platform you’re running). Briefly, install it just as you would install any other pure Python package.

As you can see from this recipe, PLY is quite easy to use, if you know even just the fundamentals of lexing and parsing. First, you define your grammar’s tokens—make a tuple or list of all their names (conventionally uppercase) bound to name tokens at your module’s top level, define for each token a regular expression bound to name t_ token_name (again at the module’s top level), import lex, and call lex.lex to build your tokenizer (lexer). Then, define your grammar’s action functions (each of them carries the relevant syntax rule—production—in its docstring in BNF, Backus-Naur Form), import yacc, and call yacc.yacc to build your parser (scanner). To parse any string, call the parse method of your parser with the string as an argument.

All the action is in your grammar’s action functions, as their name implies. Each action function receives as its single argument p a list of production elements corresponding to the production that has been matched to invoke that function; the action function’s job is to put into p[0] whatever you need as “the result” of that syntax rule getting matched. In this recipe, we use as results the very strings we have been matching, so that function is_balanced just needs to check whether the whole string is matched by the parse operation.

When you run this script the first time, you will see a warning about a shift/reduce conflict. Don’t worry: as any old hand at yacc can tell you, that’s the yacc equivalent of a rite of passage. If you want to understand that message in depth, and maybe (if you’re an ambitious person) even do something about it, open with your favorite browser the doc/ply.html file in the directory in which you unpacked PLY. That file contains a rather thorough documentation of PLY. As that file suggests, continue by studying the contents of the examples directory and then read a textbook about compilers—I suggest Dick Grune and Ceriel J.H. Jacobs, “Parsing Techniques, a Practical Guide.” The first edition, at the time of this writing, is freely available for download as a PDF file from http://www.cs.vu.nl/~dick/PTAPG.html, and a second edition should be available in technical bookstores around the middle of 2005.

16.9. Simulating Enumerations in Python

Credit: Will Ware

Problem

You want to define an enumeration in the spirit of C’s enum type.

Solution

Python’s introspection facilities let you code a class that implements a version of enum, even though Python, as a language, does not support the enum construct:

class EnumException(Exception):
    pass
class Enumeration(object):
    def _ _init_ _(self, name, enumList, valuesAreUnique=True):
        self._ _doc_ _ = name
        self.lookup = lookup = {  }
        self.reverseLookup = reverseLookup = {  }
        i = 0
        for x in enumList:
            if type(x) is tuple:
                try:
                    x, i = x
                except ValueError:
                    raise EnumException, "tuple doesn't have 2 items: %r" % (x,)
            if type(x) is not str:
                raise EnumException, "enum name is not a string: %r" % (x,)
            if type(i) is not int:
                raise EnumException, "enum value is not an integer: %r" % (i,)
            if x in lookup:
                raise EnumException, "enum name is not unique: %r" % (x,)
            if valuesAreUnique and i in reverseLookup:
                raise EnumException, "enum value %r not unique for %r" % (i, x)
            lookup[x] = i
            reverseLookup[i] = x
            i = i + 1def _ _getattr_ _(self, attr):
        try: return self.lookup[attr]
               except KeyError: raise AttributeError, attr
    def whatis(self, value):
        return self.reverseLookup[value]

Discussion

In the C language, enum lets you declare several named constants, typically with unique values (although you can also explicitly arrange for a value to be duplicated under two different names), without necessarily specifying the actual values (except when you want it to). Despite the similarity in naming, C’s enum and this recipe’s Enumeration class have little to do with the Python built-in enumerate generator, which is used to loop on ( index, item ) pairs given an iterable—an entirely different issue!

Python has an accepted idiom that’s fine for small numbers of constants:

A, B, C, D = range(4)

However, this idiom doesn’t scale well to large numbers of constants and doesn’t allow you to specify values for some constants while leaving others to be determined automatically. This recipe provides for all these niceties and, optionally, also checks that all values (both the ones explicitly specified and the ones automatically determined) are unique. Enum values are attributes of an Enumeration class instance (Volkswagen.BEETLE, Volkswagen.PASSAT, etc.). A further feature, missing in C but really quite useful, is the ability to go from the value to the corresponding name inside the enumeration (the name you get can be somewhat arbitrary for those enumerations in which you don’t constrain values to be unique).

This recipe’s Enumeration class has an initializer that accepts a string argument to specify the enumeration’s name and a sequence argument to specify the names of all values in the enumeration. Each item of the sequence argument can be a string (to specify that the value named is one more than the last value used) or else a tuple with two items (the string that is the value’s name, then the value itself, which must be an integer). The code in this recipe relies heavily on strict type checking to determine which case applies, but the recipe’s essence would not change by much if the checking was performed in a more lenient way (e.g., with the isinstance built-in function).

Each Enumeration instance has two dict attributes: self.lookup to map names to values and self.reverselookup to map values back to the corresponding names. The special method _ _getattr_ _ lets you use names with attribute syntax (e.x is mapped to e.lookup['x']), and the whatis method allows reverse lookups (i.e., find a name given a value) with similar ease.

Here’s an example of how you can use this Enumeration class:

if _ _name_ _ == '_ _main_ _':
    import pprint
    Volkswagen = Enumeration("Volkswagen",
        ("JETTA", "RABBIT", "BEETLE", ("THING", 400), "PASSAT", "GOLF",
         ("CABRIO", 700), "EURO_VAN", "CLASSIC_BEETLE", "CLASSIC_VAN"
         ))
    Insect = Enumeration("Insect",
        ("ANT", "APHID", "BEE", "BEETLE", "BUTTERFLY", "MOTH", "HOUSEFLY",
         "WASP", "CICADA", "GRASSHOPPER", "COCKROACH", "DRAGONFLY"
         ))
    def whatkind(value, enum):
        return enum._ _doc_ _ + "." + enum.whatis(value)
    class ThingWithKind(object):
        def _ _init_ _(self, kind):
            self.kind = kind
    car = ThingWithKind(Volkswagen.BEETLE)
    print whatkind(car.kind, Volkswagen)
# emitsVolkswagen.BEETLE
    bug = ThingWithKind(Insect.BEETLE)
    print whatkind(bug.kind, Insect)
# emits Insect.BEETLE
    print car._ _dict_ _
# emits {'kind': 2}
    print bug._ _dict_ _
# emits {'kind': 3}
    pprint.pprint(Volkswagen._ _dict_ _)
    pprint.pprint(Insect._ _dict_ _)
# emits dozens of line showing off lookup and reverseLookup dictionaries

Note that the attributes of car and bug don’t include any of the enum machinery because that machinery is held as class attributes, not as instance attributes. This means you can generate thousands of car and bug objects with reckless abandon, never worrying about wasting time or memory on redundant copies of the enum stuff.

16.10. Referring to a List Comprehension While Building It

Credit: Chris Perkins

Problem

You want to refer, from inside a list comprehension, to the same list object you’re building. However, the object being built by the list comprehension doesn’t have a name while you’re building it.

Solution

Internally, the Python interpreter does create a “secret” name that exists only while a list comprehension is being built. In Python 2.3, that name is usually '_[1]' and refers to the bound method append of the list object we’re building. We can use this secret name as a back door to refer to the list object as it gets built. For example, say we want to build a copy of a list but without duplicates:

>>> L = [1, 2, 2, 3, 3, 3]
>>> [x for x in L if x not in locals( )['_[1]']._ _self_ _][1, 2, 3]

Python 2.4 uses the same name to indicate the list object being built, rather than the bound-method access. In the case of nested list comprehensions, inner ones are named '_[2]', '_[3]', and so on, as nesting goes deeper. Clearly, all of these considerations are best wrapped up into a function:

import inspect
import sys
version_23 = sys.version_info < (2, 4)
def this_list( ):
    import sys
    d = inspect.currentframe(1).f_locals
    nestlevel = 1
    while '_[%d]' % nestlevel in d: nestlevel += 1
    result = d['_[%d]' % (nestlevel - 1)]
    if version_23: return result._ _self_ _
    else: return result

Using this function, we can make the preceding snippet more readable, as well as making it work properly in Python 2.4 as well as in version 2.3:

>>> [x for x in L if x not in this_list( )][1, 2, 3]

Discussion

List comprehensions may look a little like magic, but the bytecode that Python generates for them is in fact quite mundane: create an empty list, give the empty list’s bound-method append, a temporary name in the locals dictionary, append items one at a time, and then delete the name. All of this happens, conceptually, between the open square bracket ( [ ) and the close square bracket ( ]), which enclose the list comprehension.

The temporary name that Python 2.3 assigns to the bound append method is '_[1]' (or '_[2]', etc., for nested list comprehensions). This name is deliberately chosen (to avoid accidental clashes) to not be a syntactically valid Python identifier, so we cannot refer to the bound method directly, by name. However, we can access it as locals( )['_[1]']. Once we have a reference to the bound method object, we just use the bound method’s _ _self_ _ attribute to get at the list object itself. In Python 2.4, the same name refers directly to the list object, rather than to its bound method, so we skip the last step.

Having a reference to the list object enables us to do all sorts of neat party tricks, such as performing if tests that involve looking at the items that have already been added to the list, or even modifying or deleting them. These capabilities are just what the doctor ordered for finding primes in a “one-liner”, for example: for each odd number, we need to test whether it is divisible by any prime number less than or equal to the square root of the number being tested. Since we already have all the smaller primes stored and, with our new parlor trick, have access to them, this test is a breeze and requires no auxiliary storage:

import itertools
def primes_less_than(N):
    return [ p for p in itertools.chain([2], xrange(3,N,2))
             if 0 not in itertools.imap(
                 lambda x: p % x, itertools.takewhile(
                       lambda v: v*v <= p, this_list( ) ))]

The list comprehension that’s the whole body of this function primes_less_than, while long enough not to fit into a single physical line, is all in a single logical line (indeed, it must be, since any list comprehension is a single expression), and therefore qualifies as a “one-liner” if you squint in just the right way.

This simple prime-finding algorithm is nowhere near as fast as the Sieve of Eratosthenes shown in Recipe 18.10, but the ability to fit the entire algorithm inside a single expression is nevertheless kind of neat. Part of its neatness comes from the just-in-time evaluation that the functions from standard library module itertools perform so nicely.

Alas, this neat trick definitely cannot be recommended for production code. While it works in Python 2.3 and 2.4, it could easily break in future releases, since it depends on undocumented internals; for the same reason, it’s unlikely to work properly on other implementations of the Python language, such as Jython or IronPython. So, I suggest you use it to impress friends, but for any real work, stick to clearer, faster, and solid good old for loops!

16.11. Automating the py2exe Compilation of Scripts into Windows Executables

Credit: Alexander Semenov

Problem

You often use py2exe to build Windows .exe files from Python scripts, but you don’t want to bother writing a setup.py build script for each and every such script.

Solution

distutils is a package in the standard Python library, ready to be imported from your Python code. py2exe is a third-party extension to distutils for the specific task of generating Windows executables from Python code: you must download and install py2exe separately, but once installed, it cooperates smoothly with the standard distutils. Thanks to these features, you can easily write Python scripts to automate distutils tasks (including py2exe tasks). For example:

from distutils.core import setup
import sys, os, py2exe
# the key trick with our arguments and Python's sys.path
name = sys.argv[1]
sys.argv[1] = 'py2exe'
sys.path.append(os.path.dirname(os.path.abspath(name)))
setup(name=name[:-3], scripts=[name])

Save this as makexe.py in the ToolsScripts folder of your Python installation. (You should always add this folder to your Windows PATH because it contains many useful tools.) Now, from a Windows command prompt, you’re able to cd to a directory where you have placed a script (say C:MyDir), and there run, say:

C:MyDir> makexe.py myscript.py

and (assuming that you have a myscript.py script there, and .py among your Windows executable extensions, with association to the Python interpreter) py2exe prepares all the files you need for distributing your masterpiece (as a Windows executable and supporting DLLs), neatly arranged in folder c:MyDirdistmyscript.

Discussion

The distutils package is part of the Python Standard Library. It helps you prepare your Python modules and extensions for distribution, as well as letting you install such packages as distributed by others. py2exe is a freely downloadable third-party extension that works on top of distutils to help you build a Windows .exe file (and a set of supporting DLL files) from a Python-coded program, so that you can distribute your program in executable form to other Windows PCs that may not have Python installed; see http://starship.python.net/crew/theller/py2exe/, both to download py2exe and for detailed documentation of this useful tool.

Following the details given in the distutils (and py2exe) documentation, the canonical way to use distutils (including py2exe) is to write a script, conventionally always named setup.py, to perform all kinds of distutils tasks on your package. Normally, you write a setup.py for each package you distribute, placing it in the top directory of the package (known as the distribution root in distutils terminology).

However, there is nothing mandatory about the convention of writing a setup.py script per package. distutils and py2exe, after all, are written as modules to be imported from Python. So, you can, if you so choose, use all the power of Python to code scripts that help you perform distutils and py2exe tasks in whatever ways you find most convenient.

This recipe shows how I eliminate the need to write a separate setup.py script for each Python script that I convert to an executable with py2exe, and related issues such as the need to keep such scripts in dedicated “distribution root” directories. I suggest you name this recipe’s script makexe.py, but any name will do, as long as you avoid naming it py2exe.py (a natural enough temptation). (Naming it py2exe.py would break the script because the script must import py2exe, and if you named the script py2exe.py it would “import itself” instead!)

Place this script on any directory on your Windows PATH where you normally keep executable Python scripts. I suggest you use the ToolsScripts folder of the Python distribution, a folder that contains several other useful scripts you’ll want to have handy (have a look in that folder—it’s worth your time). I’m not going to delve into the details of how to set and examine your Windows PATH, open a command prompt, make your Python scripts executable, and so on. Such system administration details differ slightly on each version of Windows, and you’ll need to master them for any Windows version on which you want to perform significant programming, anyway.

Once you have implemented this Solution, you’ll find that making your Python scripts into Windows executables has become so easy and so fast that soon you’ll be distributing your neat and useful programs to friends and acquaintances right and left. You won’t need to convince them to install the Python runtime files before they can install and run your programs, either! (Of course, in this way they will end up with what amounts to several copies of the runtime files, if they install several of your compiled programs—there is little you can do about that.)

16.12. Binding Main Script and Modules into One Executable on Unix

Credit: Joerg Raedler

Problem

You have a Python application composed of a main script and some additional modules. You want to bind the script and modules into one executable file, so that no installation procedure is necessary.

Solution

Prepare the following mixed sh/Python script and save it as file zipheader.unix:

#!/bin/sh
PYTHON=$(which python 2>/dev/null)
if [ x ! -x "x$PYTHON" ] ; then
    echo "python executable not found - cannot continue!"
    exit 1
fi
exec $PYTHON - $0 $@ << END_OF_PYTHON_CODE
import sys
version = sys.version_info[:2]
if version < (2, 3):
    print 'Sorry, need Python 2.3 or better; %s.%s is too old!' % version
sys.path.insert(0, sys.argv[1])
del sys.argv[0:2]
import main
main.main( )
END_OF_PYTHON_CODE

Make sure you have the Python bytecode files for the main script of your application (file main.pyc, containing a function named main, which starts the application when called without arguments) and any additional modules your application needs (e.g., files spam.pyc and eggs.pyc). Make a zip file out of them all:

$ zip myapp.zip main.pyc spam.pyc eggs.pyc

(If you prefer, you can build the zip file with an auxiliary Python program, of course.) Next, concatenate the “header” and the zip file, and make the resulting file executable:

$ cat zipheader.unix myapp.zip > myapp
$ chmod +x myapp

That’s all! Your application is now contained in this executable file myapp. When myapp runs, the shell /bin/sh sets things up and replaces itself with the Python interpreter. The Python interpreter reopens the file as a zip file, skipping the “header” text, and finds all needed modules in the zip file itself.

Discussion

On Windows machines, you would normally use py2exe for similar tasks, as shown previously in Recipe 16.11; on Mac OS X, you would normally use py2app (although this recipe works just as well on Mac OS X as it does on any other Unix).

This recipe is particularly useful for Linux and other Unix variants that come with Python installed. By following the steps outlined in this recipe’s Solution, you can distribute a Python application as a single, self-contained standalone executable file, which runs on any version of Unix, on any hardware platform—as long as your Python application does not need any C-coded extension modules beyond the ones that come with Python itself. When you do need more, you can use Python’s own distutil package to perform more complicated packaging tasks. But for many simple Python applications and quite a few that aren’t all that simple, this recipe can be very useful, since it results in a file that can just be run as is, without needing any kind of “installation” step!

The key idea of this recipe is to exploit Python’s ability to import modules from a zip file, while skipping leading text that may precede the zip file itself. Here, as leading text, we use a small shell script that turns itself into a Python script, and within the same file continues with the zip file from which everything gets imported. The concept of importing from a zip file is described in Recipe 2.9.

In the zip file, you may, if you wish, place Python source files (with extension .py), as well as compiled bytecode files (with extension .pyc); the latter option is often preferable because if you zip up source files, Python compiles them every time you run the application, slowing your application’s startup. On the other hand, if you zip up compiled bytecode files, your application may be unable to run with versions of Python that are newer than the one you used to prepare the bytecode files, since binary compatibility of bytecode files is not guaranteed across Python releases. The best approach may be to place both sources and bytecodes in the zip file.

You may also choose to zip up optimized bytecode files (with extension .pyo)—if you do so, you need to add the flag -O right after the $PYTHON in the shell script in this recipe’s Solution. Execution speed doesn’t generally change much, but optimized execution skips assert statements, which may be important to you. Also, if you prepare the .pyo files by running Python with option -OO, all docstrings are eliminated, which may slightly reduce your application’s size on disk (although docstrings tend to compress well, so that size advantage may be minor).

If you need help in finding all the modules that you need to place in the zip file, see the modulefinder module in the Python Standard Library. Unfortunately, no real documentation about it is available at the time of this writing, but just running (in version 2.3) something like:

$ python /usr/lib/python2.3/modulefinder.py main.py

should help (you may have to change the change the path to the modulefinder.py script, depending on your Python installation). With Python 2.4, you can just use the handy new -m switch:

$ python -mmodulefinder main.py

Python 2.4’s -m switch lets you run as the main script any module that’s on Python’s sys.path—a very convenient little feature!

Table of Contents for 16. Programs About Programs

Create new playlist

Sign In

Sign Up

Chapter 16. Programs About Programs

Introduction

Lexing

Parsing

PLY, SPARK, and Other Python Parser Generators

Using Python Itself as a Little Language

Introspection

16.1. Verifying Whether a String Represents a Valid Number

Problem

Solution

Discussion

See Also

16.2. Importing a Dynamically Generated Module

Problem

Solution

Discussion

See Also

16.3. Importing from a Module Whose Name Is Determined at Runtime

Problem

Solution

Discussion

See Also

16.4. Associating Parameters with a Function (Currying)

Problem

Solution

Discussion

See Also

16.5. Composing Functions

Problem

Solution

Discussion

See Also

16.6. Colorizing Python Source Using the Built-in Tokenizer

Problem

Solution

Discussion

See Also

16.7. Merging and Splitting Tokens

Problem

Solution

Discussion

See Also

16.8. Checking Whether a String Has Balanced Parentheses

Problem

Solution

Discussion

See Also

16.9. Simulating Enumerations in Python

Problem

Solution

Discussion

See Also

16.10. Referring to a List Comprehension While Building It

Problem

Solution

Discussion

See Also

16.11. Automating the py2exe Compilation of Scripts into Windows Executables

Problem

Solution

Discussion

See Also

16.12. Binding Main Script and Modules into One Executable on Unix

Problem

Solution

Discussion

See Also

Table of Contents for
16. Programs About Programs