Credit: Raymond Hettinger
Lather, Rinse, Repeat
—Docs for my bottle of shampoo
After namespaces, iterators and generators emerged as the next “honking great ideas” in Python. Since their introduction in Python 2.2, they have come to pervade and unify the language. They encourage a loosely coupled programming style that is simple to write, easy to read, flexible, and extendable.
Simply put, the iterator protocol has two halves, a producer and a consumer. An iterable object says, “I know how to supply data one element at a time,” and the consumer says “please give me data one element at a time and say Stop when you’re done.”
The producer/consumer connection can appear in a number of
guises. The simplest is where a function or constructor wraps around
an iterable object. For example, sorted(set('simsalabim'))
has the set
constructor looping over the elements of the iterable string and a
sorted
function wrapping around the
resulting iterable set object. replaceable
literal
In addition to functions and constructors, regular Python
statements can use the in
operator
to loop over iterable objects. for line in
myfile: print line
loops over lines of an iterable file
object. Likewise, if token in sequence
loops over elements of
a sequence until it finds a match (or until it reaches the end with no
matches).
Both guises of the consumer side of the iterator protocol use
the protocol implicitly. In addition, an explicit form is more
flexible but used less often. The iterator object is saved as a
variable, it = iter(mystring)
.
Then, the iterator’s next
method is
called to retrieve a data element, elem =
it.next( )
. Such calls are usually wrapped in try
/except
statements to catch the StopIteration
exception that the iterator
raises when the data stream is exhausted.
All of these approaches provide the full range of iterator
benefits, including loose coupling and memory friendliness. The loose
coupling is evident in the first example, where the independently
created and maintained sorted
function, set
data type, and
string
objects were readily
combined. The memory friendliness derives from the one-at-a-time
structure of the iterator protocol. Programs using iterators are
likely to be less resource intensive and more scalable than their
list
-based counterparts.
An object that wants to be iterable should implement an _ _iter_ _
method, which returns an iterator
object. Ideally, the iterator should be a distinct object from the
iterable, to make it possible to have multiple iterators over the same
iterable container. There are exceptions to this general
recommendation: for example, a sequential file
object does not readily lend itself to
multiple iterations; therefore, it is more appropriate in this case to
have the file object be its own iterator rather than return a separate
iterator object; given any file
instance f
, indeed, iter(
f
) is
f
.
Any iterator object must implement a next
method and an _ _iter_ _
method. The next
method should raise StopIteration
when the iteration is
complete. Care should be taken that subsequent calls to next
continue to raise StopIteration
(once stopped, it stays
stopped). The _ _iter_ _
method of
an iterator object should always return the iterator itself (_ _iter_ _
is idempotent on iterators). This
simplifies client code by allowing it to treat iterators and iterables
the same way (i.e., both return an iterator in response to the
iter
function).
To be useful, most iterators have a stored state that enables them to return a new data element on each call. The previously described responsibilities and the need to have a stored state both suggest that classes are the way to create iterable objects. That approach is obvious, explicit, and rarely used (only two recipes in this chapter make any use of classes).
Instead of writing classes, two alternate approaches dominate.
Starting with the observation that many functions and types both
accept iterable inputs and return iterable outputs, an obvious
approach is to link them together in a “pipes and filters” style to
create new tools. For example, def uniq(seq):
return sorted(set(seq))
is a way to create a new tool
directly from existing functions and types. Like functional
programming, the resulting code is terse, readable, trivial to debug,
and often runs at the speed of compiled C code. The economy of this
approach motivated the creation of an entire module of iterator
building blocks, the itertools
module. Indeed, many of the brilliant, effective recipes in this
chapter make frequent use of itertools
components.
If no combination of building blocks solves the problem, the
next best approach is to write a generator. The Recipe 19.1 shows how
trivially easy it is to write a generator. By introducing a yield
keyword, the responsibilities of
creating an iterator are handled automatically. The iterator objects
obtained by calling a generator are distinct, save their state, have
an idempotent _ _iter_ _
method,
and have a next
method that raises
StopIteration
when complete and
stays stopped if called again afterwards. Python internally takes care
of all of these details. Because of generators’ compelling simplicity,
most of the recipes in this chapter make use of generators.
Starting with version 2.4, Python continued its evolution toward
using iterators everywhere by introducing generator expressions
(genexps for short). Genexps can be likened to
a memory-efficient, scalable form of list comprehensions. Simply by
replacing brackets with parentheses, an expression will yield one
element at a time rather than filling memory all at once. Used
correctly (i.e., in a context where they are consumed immediately, one
item at a time), genexps can offer remarkable clarity and economy:
sum(x*x for x in xrange(10))
is a
great way to express the sum of the squares of the first ten natural
numbers.
Paradoxically, the simpler and more general an idea, the more likely that people will find extraordinary and unexpected ways of using it. Here is a brief sampling of the ways that iterators and generators have been pushed to their outer limits.
Observing that the yield
keyword has the unique capability of
stopping execution, saving state, and later resuming, it is not
surprising that techniques have been discovered for using generators
to simulate co-routines and continuations. The core idea is to
implement each routine as a generator and having a
dispatch function launch the routines in
orderly succession. Whenever a task switch is needed, the routines
yield back to the dispatcher, which then launches or resumes the next
routine by calling its next
method.
Small complications are involved for startup, termination, and data
sharing, but they each are solvable without much ado and present fewer
challenges than equivalent thread-based solutions. See Recipe 9.8 for an
example.
Observing that some tools can be both producers and consumers, it is natural to want to stack them together like pipes and filters. While that analogy can lead to useful decoupling, be aware that underlying models are different. Iterators do not run independently from start to finish; instead, an outermost layer is always in control, requesting data elements one at a time, so that nothing runs until the outer layer starts making requests.
When stacking tools together (as in the first example with
sorted
, set
, and a string), the code takes on the
appearance of a functional programming language. The resemblance is
not shallow: iterators do fulfill some of the promise of lazy
languages. So, it is natural to borrow some of the most successful
techniques from those languages, such as Haskell and SML.
One such technique is to write innermost iterators to yield infinite streams and concentrate the control logic in an outermost driver function. For instance, in numerical programming, write a generator that yields successively better approximations to a desired result and call it from a function that stops whenever two successive approximations fall within a tolerance value. Separating the control logic from the calculation decouples the two, making them easier to write, test, and debug, and makes them more reusable in other contexts.
Here are some instructive snippets. Consider each of them carefully, study how they work, and you’ll be well on your way towards understanding how best to link iterators together to solve practical problems. Each of the following lines is independent from the “other"s:
result = dict(enumerate(myseq)) result = set(word for line in page for word in line.split( )) def dotproduct(v1, v2): return sum(itertools.imap(operator.mul, v1, v2)) def dotproduct(v1, v2): return sum(x*y for x,y in itertools.izip(v1, v2)) randgen = itertools.starmap(random.random, itertools.repeat(( ))) randgen = iter(random.random, -1.0)
The idea for restartable iterators surfaces every so often and
then drowns in quicksand. sys.stdin
is a plain example of an iterable that cannot logically be restarted
unless an entire session is saved in memory. A craving for
restartability should be taken as a cue that a list
might well be a more appropriate data
structure.
Just because iterators cannot be restarted doesn’t mean they
cannot be abandoned in mid-stream. The lazy, just-in-time style of
production is a key feature of iterators. Take advantage of it. That’s
why the for
statement supports a
break
keyword, after all.
The core itertools
and their
derivatives (see the recipes in the itertools
docs that are part of the Python
Library Reference) all run at nearly the speed
of compiled code. When Python 2.4 introduced a native set
data type, I timed it against the
pure-Python version, sets.py, and
learned that some of the set logic (intersection, union, etc.)
achieved only a two to one increase in speed. The reason was that
sets.py used itertools
, and itertools
performance was exceptional. So,
when performance is an issue, consider an itertools
solution before turning to more
labor-intensive optimizations or native language extensions.
Credit: Dinu Gherman, Paul Winkler, Stephen Levings
You need an arithmetic progression, like the built-in xrange
but with float values (xrange
works only on integers).
Although float arithmetic progressions are not available as built-ins, it’s easy to code a generator to supply them:
import itertools def frange(start, end=None, inc=1.0): "An xrange-like generator which yields float values" # imitate range/xrange strange assignment of argument meanings if end is None: end = start + 0.0 # Ensure a float value for 'end' start = 0.0 assert inc # sanity check for i in itertools.count( ): next = start + i * inc if (inc>0.0 and next>=end) or (inc<0.0 and next<=end): break yield next
Sadly missing in the Python Standard Library, the generator in
this recipe lets you use arithmetic progressions, similarly to the
built-in xrange
but with float
values.
Many theoretical restrictions apply, but this generator is more useful in practice than in theory. People who work with floating-point numbers all the time tell many war stories about billion-dollar projects that failed because someone did not take into consideration the strange things that modern hardware can do, at times, when comparing floating-point numbers. But for pedestrian cases, simple approaches like this recipe generally work.
This observation by no means implies that you can afford to ignore the fundamentals of numerical analysis, if your programs need to do anything at all with floating-point numbers! For example, in this recipe, we rely on a single multiplication and one addition to compute each item, to avoid accumulating error by repeated additions. Precision would suffer in a potentially dangerous way if we “simplified” the first statement in the loop body to something like:
next += inc
as might appear very tempting, were it not for those numerical analysis considerations.
One variation you may want to consider is based on pre-computing the number of items that make up the bounded arithmetic progression:
import math def frange1(start, end=None, inc=1.0): if end == None: end = start + 0.0 # Ensure a float value for 'end' start = 0.0 nitems = int(math.ceil((end-start)/inc)) for i in xrange(nitems): yield start + i * inc
This frange1
version may or may not be faster
than the frange
version shown in the solution; if the
speed of this particular generator is crucial to your programs, it’s
best to try both versions and measure resulting times. In my limited
benchmarking, on most of the hardware I have at hand,
frange1
does appear to be consistently faster.
Talking about speed—believe it or not, looping with for i in itertools.count( )
is measurably
faster than apparently obvious lower-level
alternatives such as:
i = 0
while True:...loop body unchanged...
yield next
i += 1
Do consider using itertools
any time you want speed, and you may be in for more of these pleasant
surprises.
If you work with floating-point numbers, you should definitely
take a look at Numeric
and other
third-party extension packages that make Python such a powerful
language for floating-point computations. For example, with Numeric
, you could code something
like:
import math, Numeric def frange2(start, end=None, inc=1.0, typecode=None): if end == None: end = start + 0.0 # Ensure a float value for 'end' start = 0.0 nitems = math.ceil((end-start)/inc) return Numeric.arange(nitems) * inc + start
This one is definitely going to be faster
than both frange
and frange1
if you
need to collect all of the progression’s items into a sequence.
Documentation for the xrange
built-in function, and the itertools
and math
modules, in the Library
Reference; Numeric Python (http://www.pfdubois.com/numpy/).
Credit: Tom Good, Steve Alexander
You have an iterable object x
(it
might be a sequence or any other kind of object on which you can
iterate, such as an iterator, a file
, a dict
) and need a list
object y
,
with the same items as x
and in the same
order.
When you know that iterable object x
is bounded (so that, e.g., a loop for
item
in
x
would surely terminate),
building the list object you require is trivial:
y = list(x)
However, when you know that x
is
unbounded, or when you are not sure, then you must ensure termination
before you call list
. In
particular, if you want to make a list with no more than
n
items from x
,
then standard library module itertools
' function islice
does exactly what you need:
import itertools y = list(itertools.islice(x, N))
Python’s generators, iterators, and sundry other iterables, are
a wondrous thing, as this entire chapter strives to point out. The
powerful and generic concept of iterable is a
great way to represent all sort of sequences, including unbounded
ones, in ways that can potentially save you huge (and even infinite!)
amounts of memory. With the standard library module itertools
, generators you can code yourself,
and, in Python 2.4, generator expressions, you can perform many
manipulations on completely general iterables.
However, once in a while, you need to build a good old-fashioned
full-fledged list
object from such
a generic iterable. For example, building a list is the simplest way
to sort or reverse the items in the iterable, and lists have many
other useful methods you may want to apply. As long as you know for
sure that the iterable is bounded (i.e., has a
finite number of items), just call list
with the iterable as the argument, as
the “Solution” points out. In particular, avoid the goofiness of
misusing a list comprehension such as [
i
for
i
in
x
]
, when list(
x
)
is faster, cleaner,
and more readable!
Calling list
won’t help if
you’re dealing with an unbounded iterable. The
need to ensure that some iterable x
is
bounded also arises in many other contexts, besides that of calling
list(
x
)
: all “accumulator” functions
(sum(
x
)
, max(
x
)
, etc.) intrinsically need a
bounded-iterable argument, and so does a statement such as for
i
in
x
(unless you
have appropriate conditional break
s
within the loop’s body), a test such as if
i
in
x
, and so
on.
If, as is frequently the case, all you want is to ensure that no
more than n
items of iterable
x
are taken, then itertools.islice
, as shown in the
“Solution”, does just what you need. The islice
function of the standard library
itertools
module offers many other
possibilities (essentially equivalent to the various possibilities
that slicing offers on sequences), but out of all of them, the simple
“truncation” functionality (i.e., take no more than
n
items) is by far the most frequently
used. The programming language Haskell, from which Python took many of
the ideas underlying its list comprehensions and generator expression
functionalities, has a built-in take
function to cater to this rather
frequent need, and itertools.islice
is most often used as an equivalent to Haskell’s built-in take
.
In some cases, you cannot specify a maximum number of items, but
you are able to specify a generic condition that
you know will eventually be satisfied by the items of iterable
x
and can terminate the proceedings.
itertools.takewhile
lets you deal
with such cases in a very general way, since it accepts the
controlling predicate as a callable argument. For example:
y = list(itertools.takewhile((11)._ _cmp_ _, x))
binds name y
to a new list made up of
the sequence of items in iterable x
up to,
but not including, the first one that equals 11
. (The reason we need to code (11)._ _cmp_ _
with parentheses is a
somewhat subtle one: if we wrote 11._ _cmp_
_
without parentheses, Python would
parse 11
. as a floating-point
literal, and the entire construct would be syntactically invalid. The
parentheses are included to force the tokenization we
mean, with 11
as an integer literal and the period
indicating an access to its
attribute, in this case, bound method _ _cmp_
_
.)
For the special and frequent case in which the terminating
condition is the equality of an item to some given value, a useful
alternative is to use the two-arguments variant of the built-in
function iter
:
y = list(iter(iter(x).next, 11))
Here, the iter(x)
call (which
is innocuous if x
is already an iterator)
gives us an object on which we can surely access callable (bound
method) next
—which is necessary,
because iter
in its two-arguments
form requires a callable as its first argument. The second argument is
the sentinel value, meaning the value that
terminates the iteration as soon as an item equal to it appears. For
example, if x
were a sequence with items 1,
6, 3, 5, 7, 11, 2, 9, . . , y
would now be
the list [1, 6, 3, 5, 7]
. (The sentinel
value itself is excluded: from the beginning, included, to the end,
excluded, is the normal Python convention for just about all loops,
implicit or explicit.)
Library Reference documentation on
built-ins list
and iter
, and module itertools
.
Credit: Tom Good, Leandro Mariano Lopez
You want an unbounded generator that yields, one item at a time, the entire (infinite) sequence of Fibonacci numbers.
Generators are particularly suitable for implementing infinite sequences, given their intrinsically “lazy evaluation” semantics:
def fib( ):
''' Unbounded generator for Fibonacci numbers '''
x, y = 0, 1
while True:
yield x
x, y = y, x + y
if _ _name_ _ == "_ _main_ _":
import itertools
print list(itertools.islice(fib( ), 10))
#outputs: [0, 1, 1, 2, 3, 5, 8, 13, 21, 34]
Generators make it quite easy to work with unbounded (infinite)
sequences. In this recipe, we show a generator that produces all of
the (infinitely many) Fibonacci numbers one after the “other”. (If you
want the variant in which the sequence starts with 1, 1, 2
, . . . , rather than the one,
implemented here, which starts with 0, 1,
1
, . . . , just interchange the two statements in the loop’s
body.)
It’s worth reflecting on why a generator is so perfectly
suitable for implementing an unbounded sequence and letting you work
with it. Syntactically, a generator is “just” a function containing
the keyword yield
. When you call a
generator, however, the function body does not yet execute. Rather,
calling the generator gives you a special anonymous iterator object
that wraps the function’s body, the function’s local variables
(including arguments, which, for any function, are local variables
that happen to be initialized by the caller), and the current point of
execution, which is initially the start of the function.
When you call this anonymous iterator object’s next
method, the function body executes up
to the next yield
statement.
yield
’s argument is returned as the
result of the iterator’s next
method, and the function is “frozen”, with its execution state intact.
When you call next
again on the
same iterator object, the function “thaws” and continues from where it
left off, again up to the next yield
statement.
If the function body “falls off the end”, or executes a return
, the iterator object raises StopIteration
to indicate the end of the
sequence. But, of course, if the sequence that the generator is
producing is not bounded, the iterator never raises StopIteration
. That’s okay, as long as you
don’t rely on such an exception as the only way to terminate a loop.
In this recipe, for example, the anonymous iterator object is passed
as an argument to itertools.islice
:
as shown in Recipe
19.2, islice
is the most
typical way in which an unbounded iterator is made finite (truncated
at an externally imposed boundary).
The main point to retain is that it’s all right to have infinite sequences represented by generators, since generators are computed lazily (in other words, each item gets computed just in time, when required), as long as some control structure ensures that only a finite number of items are required from the generator. The answer to our curiosity as to why generators are so excellently suitable for this use is in the anonymous iterator object which a generator returns when we call it: that anonymous iterator wraps some code (the generator’s function body) and some state (the function’s local variables, and, crucially, the point at which the function’s execution is to resume) in just the way that’s most convenient for the computation of most sequences, be they bounded or unbounded.
Leonardo Pisano (meaning “from Pisa”), most often called Leonardo Bigollo (the traveler or “the good for nothing”) during his lifetime in the 12th and 13th centuries, and occasionally Leonardo Fibonacci (for his connection to the Bonacci family), must look down with considerable pride from his place in the mathematicians’ Empyreon. Although his most notable contributions were the introduction of decimal notation (arabic numerals) in the West, and the codification of the rules for double-entry bookkeeping, these monumental achievements are not usually connected to his name. The one that is, however—from the third problem in his Liber Abaci, which he originally expressed in terms of a rabbit-raising farm—still provides interesting applications for the distant successors of the abacus, modern computers.
Recipe 19.2, shows how to make bounded iterators from unbounded (or “potentially unbounded”) ones.
Credit: Brett Cannon, Oren Tirosh, Alex Martelli
Python’s multiple unpacking assignment is very handy when you are unpacking all the items from a sequence and binding each to a name. However, you often need to unpack (and bind to names) only some items from the “front” of a sequence, and bind another name to “the rest” of the sequence (the part you didn’t unpack).
A generator provides an elegant solution to this problem:
def peel(iterable, arg_cnt=1):
""" Yield each of the first arg_cnt items of the iterable, then
finally an iterator for the rest of the iterable. """
iterator = iter(iterable)
for num in xrange(arg_cnt):
yield iterator.next( )
yield iterator
if _ _name_ _ == '_ _main_ _':
t5 = range(1, 6)
a, b, c = peel(t5, 2)
print a, b, list(c)
# emits:1 2 [3, 4, 5]
Python supports the handy idea of multiple unpacking assignment.
Say that t5
is any sequence of five items.
Then, you can code:
a, b, c, d, e = t5
to bind a name to each item of
t5
.
However, you often do not know (nor care) exactly how many items
a certain sequence t
holds: you want to
bind (say) two names, one to each of the first two items, and a third
name to “the rest” of t
(this requirement
does imply that t must hold at least two items).
If you know that t
is a “proper” sequence,
with support for slicing, not just an arbitrary iterable, you can
code:
a, b = t[:2] c = t[2:]
but this is nowhere as elegant or handy as the multiple
unpacking assignment. Moreover, if you are not certain about the
nature of t
(i.e., if
t
can be any iterable, not necessarily
supporting slice syntax), the task becomes more cumbersome:
c = iter(t5) a = c.next( ) b = c.next( )
Given these issues, the Python Development mailing list[1] once discussed a new syntax for generalized multiple unpacking assignment, such that:
a, b, *c = t
would perform exactly this task—bind names
a
and b
to the
first two items of t
and name
c
to “the rest”.
I didn’t like the idea of making the Python language bigger by adding this extra functionality to assignment statements, so I came up with this recipe’s generator. This generator provides this functionality fully and without any need to add any new syntax to Python.
Just one caveat: you must make sure that you pass the
arg_cnt
argument properly. If you pass a wrong value
for arg_cnt
, or if the sequence you pass to
peel
is shorter than arg_cnt
, you
get an exception at runtime. But then, you also get a Python exception
at runtime if you try to perform a multiple assignment and the number
of names you have on the left of the =
sign is not identical to the number of
items of the sequence you have on the right. Therefore, this recipe
isn’t any different from normal, multiple unpacking assignment in this
respect. If you think it is important to relax some parts of this
requirement, see Recipe
19.5.
Language Reference and Python in a Nutshell about multiple unpacking assignments; Recipe 19.5.
Credit: Sami Hangaslammi, Peter Cogolo
You want to unpack (and bind to names) some items from the
“front” of a sequence and bind another name to “the rest” of the
sequence (the part you didn’t unpack). You want to obtain the number
of items to unpack automatically, based on how many names are on the
left of the =
sign in a multiple
unpacking assignment.
The previous approach in Recipe 19.4 is clean and elegant, but you have to “manually” pass the number of items to unpack. If you’re willing to stoop to a little black magic, peering into stack frames and bytecodes, you may be able to bypass that requirement:
import inspect, opcode def how_many_unpacked( ): f = inspect.currentframe( ).f_back.f_back if ord(f.f_code.co_code[f.f_lasti]) == opcode.opmap['UNPACK_SEQUENCE']: return ord(f.f_code.co_code[f.f_lasti+1]) raise ValueError, "Must be a generator on RHS of a multiple assignment!" def unpack(iterable): iterator = iter(iterable) for num in xrange(how_many_unpacked( )-1): yield iterator.next( ) yield iterator if _ _name_ _ == '_ _main_ _': t5 = range(1, 6) a, b, c = unpack(t5) print a, b, list(c)
While arguably spiffy, this recipe is a bit fragile, as you could well expect from a function relying on introspection on bytecode: while the recipe works in Python 2.3 and 2.4, any future release of Python might easily generate bytecode for a multiple unpacking assignment in a somewhat different way, and thus break the recipe.
Moreover, as presented, the recipe relies on
how_many_unpacked
being called specifically from a
generator; if you call it from an ordinary function, it does not work,
since in that case the UNPACK_SEQUENCE
bytecode in the caller’s
caller happens to fall at offset f.f_lasti+3
instead of f.f_lasti
.
For example, the following code doesn’t work with the recipe’s
Solution because enumfunc
is an ordinary function,
not a generator:
def enumfunc( ): return xrange(how_many_unpacked( )) a, b, c, d, e = enumfunc( )
However, the following code does work:
def enumgen( ): for x in xrange(how_many_unpacked( )): yield x a, b, c, d, e = enumgen( )
because enumgen
is a
generator.
In other words, this recipe is a hack—arguably a neat hack (to the point that one of the editors of this Cookbook successfully lobbied the “other” two and managed to obtain the recipe’s inclusion in this volume), but, nevertheless, a hack. Therefore, you probably do not want to use this approach in “production code”, meaning code that must stay around for a long time and will be maintained across future versions of Python.
Nevertheless, you could make
how_many_unpacked
work in both contexts by making it
a little bit more complicated:
def how_many_unpacked( ): f = inspect.currentframe( ).f_back.f_back bytecode = f.f_code.co_code ups_code = opcode.opmap['UNPACK_SEQUENCE'] if ord(bytecode[f.f_lasti]) == ups_code: return ord(bytecode[f.f_lasti+1]) elif ord(bytecode[f.f_lasti+3]) == ups_code: return ord(bytecode[f.f_lasti+4]) else: raise ValueError, "Must be on the RHS of a multiple assignment!"
With this more complicated variant,
how_many_unpacked
would work when called from either
a generator or an ordinary function. However, I recommend sticking
with the simpler version presented in this recipe’s Solution, and
calling how_many_unpacked
only from the given
unpack
generator, or a few other specific
generators.
Even such a limited use can be considered debatable, since most
Pythonistas prefer clarity and simplicity to the risky kind of
“convenience” that can be obtained by such shortcuts. After all, this
recipe’s only advantage, in comparison to Recipe 19.4, is that you
save yourself the trouble of passing to unpack
the
number of items required, which is nice, but clearly, not all that
crucial.”
Recipe 19.4;
Language Reference and Python in a
Nutshell about multiple unpacking assignments;
Library Reference and Python in a
Nutshell about library modules inspect
and opcode
.
Credit: Gyro Funch, Alex Martelli
You have an iterable p
and need to
get the n
non-overlapping extended slices
of stride n
, which, if the iterable was a
sequence supporting extended slicing, would be
p
[
0
:
:n
]
, p
[1:
:n
]
, and so on up to
p
[
n
-1:
:n
]
.
While extended slicing would return sequences of the same type
we start with, it’s much more sensible to specify a strider
function that, instead, solves this
problem by returning a list of lists:
def strider(p, n): """ Split an iterable p into a list of n sublists, repeatedly taking the next element of p and adding it to the next sublist. Example: >>> strider('abcde', 3) [['a', 'd'], ['b', 'e'], ['c']] In other words, strider's result is equal to: [list(p[i::n]) for i in xrange(n)] if iterable p is a sequence supporting extended-slicing syntax. """ # First, prepare the result, a list of n separate lists result = [ [ ] for x in xrange(n) ] # Loop over the input, appending each item to one of # result's lists, in "round robin" fashion for i, item in enumerate(p): result[i % n].append(item) return result
The function in this recipe takes an iterable
p
and pulls it apart into a user-defined
number n
of pieces (specifically, function
strider
returns a list of sublists), distributing
p
’s items into what would be the
n
extended slices of stride
n
if p
were a
sequence.
If we were willing to sacrifice generality, forcing argument
p
to be a sequence supporting extended
slicing, rather than a generic iterable, we could use a very different
approach, as the docstring of strider
indicates:
def strider1(p, n): return [list(p[i::n]) for i in xrange(n)]
Depending on our exact needs, with such a strong constraint on
p
, we might omit the list
call to make each subsequence into a
list, and/or code a generator to avoid consuming extra memory to
materialize the whole list of results at once:
def strider2(p, n): for i in xrange(n): yield p[i::n]
or, equivalently:
import itertools def strider3(p, n): return itertools.imap(lambda i: p[i::n], xrange(n))
or, in Python 2.4, with a generator expression:
def strider4(p, n): return (p[i::n] for i in xrange(n))
However, none of these alternatives accepts a generic iterable
as p
—each demands a full-fledged
sequence.
Back to this recipe’s exact specs, the best way to enhance the
recipe is to recode it to avoid low-level fiddling with indices. While
doing arithmetic on indices is conceptually quite simple, it can get
messy and indeed is notoriously error prone. We can do better by a
generous application of module itertools
from the Python Standard
Library:
import itertools def strider5(p, n): result = [ [ ] for x in itertools.repeat(0, n) ] resiter = itertools.cycle(result) for item, sublist in itertools.izip(p, resiter): sublist.append(item) return result
This strider5
version uses three functions from
module itertools
—all of the
functions in module itertools
return iterable objects, and, as we see in this case, their results
are therefore typically used in for
loops. Function repeat
yields an
object, repeatedly, a given number of times, and here we use it
instead of the built-in function xxrange
to control the list comprehension
that builds the initial value for result
. Function
cycle
takes an iterable object and
returns an iterator that walks over that iterable object repeatedly
and cyclically—in other words, cycle
performs exactly the round-robin
effect that we need in this recipe. Function izip
is essentially like the built-in
function zip
, except that it
returns an iterator and thus avoids the memory-consumption overhead
that zip
incurs by building its
whole result list in memory at once.
This version achieves deep elegance and conceptual simplicity
(although you may need to gain some familiarity with itertools
before you agree that this version
is simple!) by foregoing all index arithmetic and leaving all of the
handling of the round-robin issues to itertools.cycle
. resiter
,
per se, is a nonterminating iterator, but the function deals
effortlessly with that. Specifically, since we use
resiter
together with p
as
arguments to izip
, termination is
assured (assuming, of course, that p
does
terminate!) by the semantics of izip
, which, just like built-in function
zip
, stops iterating as soon as any
one of its arguments is exhausted.
The itertools
module is part
of the Python Standard Library and is documented in the
Library Reference portion of Python’s online
documentation; the Library Reference and
Python in a Nutshell docs about the built-ins
zip
and xrange
, and extended-form slicing of
sequences.
Credit: Peter Cogolo, Steven Bethard, Ian Bicking
You have an iterable s
and need to
make another iterable whose items are sublists (i.e.,
sliding windows), each of the same given
length, over s
' items, with successive
windows overlapping by a specified amount.
We can combine built-in function iter
and function islice
from the standard library module
itertools
to code a generator to
solve our problem:
import itertools def windows(iterable, length=2, overlap=0): it = iter(iterable) results = list(itertools.islice(it, length)) while len(results) == length: yield results results = results[length-overlap:] results.extend(itertools.islice(it, length-overlap)) if results: yield results if _ _name_ _ == '_ _main_ _': seq = 'foobarbazer' for length in (3, 4): for overlap in (0, 1): print '%d %d: %s' % (length, overlap, map(''.join, windows(seq, length, overlap)))
This module, when run as a main script, emits:
3 0: ['foo', 'bar', 'baz', 'er'] 3 1: ['foo', 'oba', 'arb', 'baz', 'zer', 'r'] 4 0: ['foob', 'arba', 'zer'] 4 1: ['foob', 'barb', 'baze', 'er']
When you know you don’t need any overlap, a fast and concise alternative is available:
def chop(iterable, length=2): return itertools.izip(*(iter(iterable),)*length)
The body of this concise alternative may be a bit confusing
until you realize that the two occurrences of the asterisk ( * )
there play different roles: the first
one is part of a *args
syntax form
(passing the elements of a sequence as separate positional arguments),
the second one indicates that a sequence (the Singleton tuple (iter(iterable),)
must be repeated
length
times.
In many cases, we need a sequence of sub-sequences of a given
length, and we have to start with a “flat” iterable. For example, we
can build a dictionary with given keys and values by calling dict
with a sequence of two-item
sequences—but what if we start with a “flat” sequence where keys and
values just alternate? The function windows
in this
recipe meets this need:
the_dict = dict(windows(flat_alternating_keys_and_values))
Or, say we have an iterable whose items are the amounts of sales on each day. To turn it into an iterable whose items are the amounts of sales in each week (seven days):
weekly_sales = itertools.imap(sum, windows(daily_sales, 7))
The two use cases just presented are examples of how
windows
can be useful when called without overlap (in
other words, with an overlap
argument of 0
, its default value), so the alternative
chop
function also presented in the recipe would be
just as good (and faster). However, overlap is often useful when you
deal with iterables that are signals, or time series. For example, if
you have a function average
such as:
def average(sequence): return sum(sequence)/float(len(sequence))
then you can apply a simple low-pass filter to a signal:
filtered = itertools.imap(average, windows(raw_signal, 5, 2))
or get the moving average daily sales from the iterable of daily sales:
mvavg_daily_sales = itertools.imap(average, windows(daily_sales, 7, 6))
The implementation of the windows
generator in
this recipe is quite straightforward, if you’re familiar with itertools.islice
(and you should be, if you
deal with iterables!). For the first “window”, we must clearly fill
list results
with the appropriate number of items
(islice
does that for us). At each
subsequent step, we must throw away a certain number of items from the
“front” of results
(we do that conveniently by list
slicing, since results
is, indeed, a list
) and replenish the same number at the
back (we do that by calling the extend
method of results
,
with islice
providing the needed
“new” items). That number, as a little reasoning shows, is exactly
that given by the expression length-overlap
. The loop terminates, if
ever, only when results
cannot be entirely
replenished. (The loop never executes if results
cannot even be filled entirely in the first place.)
When the loop terminates, we may be left with a “tail” in
results
, a “last window” that is shorter than
length
. What to do in that case depends on your
application’s exact needs. The recipe, as given above, just yields the
shorter window as the last item of the generator, which is what we’d
probably want in all of the previously mentioned use cases. In other
cases, we might want to drop the last, too-short window (just omit the
last two statements in function windows
as given in
the recipe), raise an exception (when we know that such a situation
should never occur), or pad the last window to the required length
with a pad value such as None
, by
changing the last two statements in function windows
to something like:
if result: result.extend(itertools.repeat(None, length-len(result))) yield result
One important implementation detail: function
windows
, as coded in the recipe, yields a new list
object at each step. It takes some time to generate all of these
objects. In some cases, it may be convenient to the caller to know
that each object it gets is a separate and independent list. Such
knowledge enables the caller to store or modify the objects it gets,
without having to make explicit copies. However, none of the use cases
we discussed gets any benefit from this feature. So, you could
optimize, by yielding the same list object every time. If you want
that optimization, just change the statement:
results = results[length-overlap:]
into:
del results[:length-overlap]
If you’re applying this optimization, and you’re using Python
2.4, you should also consider using the new type collections.deque
instead of list
. In order to do so, you need to add the
statement:
import collections
at the start of your code, change the only occurrence of
list
in the recipe into collections.queue
, and further change the
updating of results
to avoid slicing, using,
instead:
for i in xrange(length-overlap): results.popleft( )
If your windows are long, and if they overlap substantially,
using deque
instead of list
might give you better performance,
since deque
is optimized to support
adding and removing items at either end, while
list
s, being compact arrays in
memory, are inherently slow (specifically,
O
(
n
)
for
a list of length n
) when you add or remove
items at the beginning.
When you want windows of some length
n
that overlap specifically by
n-1
items, function itertools.tee
, new in Python 2.4, offers an
elegant alternative approach. Say that you want to look at each item
of the iterable, with access to a few neighboring items and some
padding at both ends, so that you get just as many windows as there
are items in the iterable. In Python 2.4, you could then code:
import itertools as IT def windowed(iterable, pre=1, post=1, padding=None): # tee-off one iterator for each index in the window copies = IT.tee(iterable, pre + 1 + post) pre_copies, copy, post_copies = copies[:pre], copies[pre], copies[pre+1:] # iterators before the element have their start filled in with the # padding value. no need to slice off the ends, izip will do that. pre_copies = [IT.chain(IT.repeat(padding, pre - i), itr) for i, itr in enumerate(pre_copies)] # iterators after the element have their starts sliced off and their # end filled in with the padding value, endlessly repeated. post_copies = [IT.chain(IT.islice(itr, i + 1, None), IT.repeat(padding)) for i, itr in enumerate(post_copies)] # zip the elements with their preceding and following elements return IT.izip(*(pre_copies + [copy] + post_copies))
For example:
>>> print list(windowed(xrange(4), 1, 2, 'x'))[('x', 0, 1, 2), (0, 1, 2, 3), (1, 2, 3, 'x'), (2, 3, 'x', 'x')]
If you use Python 2.4 and want this flavor of “sliding windows”
over the iterable, with specified “padding” at both ends, you might
prefer this windowed
function to the recipe’s
windows
generator.
Library Reference documentation on
built-in iter
and module itertools
.
Credit: Andy McKay, Hamish Lawson, Corey Coughlin
You need to loop through every item of multiple iterables in parallel, meaning that you first want to get a tuple with all of the first items of each iterable, next, a tuple with all of the “second items”, and so forth.
Say you have two iterables (lists, in this case) such as:
a = ['a1', 'a2', 'a3'] b = ['b1', 'b2']
If you want to loop “in parallel” over them, the most general and effective approach is:
import itertools for x, y in itertools.izip(a, b): print x, y
This snippet outputs two lines:
a1 b1 a2 b2
The most general and effective way to loop “in parallel” over
multiple iterables is to use function izip
of standard library module itertools
, as shown in the “Solution”. The
built-in function zip is
an
alternative that is almost as good:
for x, y in zip(a, b): print x, y
However, zip
has one downside
that can hurt your performance if you’re dealing with long sequences:
it builds the list of tuples in memory all at once (preparing and
returning a list), while you need only one tuple at a time for pure
looping purposes.
Both zip
and itertools.izip
, when you iterate in parallel
over iterables of different lengths, stop as soon as the “shortest”
such iterable is exhausted. This approach to termination is normally
what you want. For example, it lets you have one or more
non-terminating iterable in the zipping, as long as at least one of
the iterables does terminate—or (in the case of izip
, only) as long as you use some control
structure, such as a conditional break
within a for
statement, to ensure you always require
a finite number of items and do not loop endlessly.
In some cases, when iterating in parallel over iterables of
different lengths, you may want shorter iterables to be conceptually
“padded” with None
up to the length
of the longest iterable in the zipping. For this special need, you can
use the built-in function map
with
a first argument of None
:
for x, y in map(None, a, b): print x, y
map
, like zip
, builds and returns a whole list. If
that is a problem, you can reproduce map
’s pad with None
’s behavior by coding your own
generator. Coding your own generator is also a good approach when you
need to pad shorter iterables with some value that is different from
None
.
If you need to deal only with specifically two sequences, your iterator’s code can be quite straightforward and linear:
import itertools def par_two(a, b, padding_item=None): a, b = iter(a), iter(b) # first, deal with both iterables via izip until one is exhausted: for x in itertools.izip(a, b): yield x # only one of the following two loops, at most, will execute, since # either a or b (or both!) are exhausted at this point: for x in a: yield x, padding_item for x in b: yield padding_item, x
Alternatively, you can code a more general function, one that is able to deal with any number of sequences:
import itertools def par_loop(padding_item, *sequences): iterators = map(iter, sequences) num_remaining = len(iterators) result = [padding_item] * num_remaining while num_remaining: for i, it in enumerate(iterators): try: result[i] = it.next( ) except StopIteration: iterators[i] = itertools.repeat(padding_item) num_remaining -= 1 result[i] = padding_item if num_remaining: yield tuple(result)
Here’s an example of use for generator
par_loop
:
print map(''.join, par_loop('x', 'foo', 'zapper', 'ui'))
# emits:['fzu', 'oai', 'opx', 'xpx', 'xex', 'zrx']
Both par_two
and par_loop
start by calling the built-in function iter
on all of their arguments and
thereafter use the resulting iterators. This is important, because the
functions rely on the state that these iterators
maintain. The key idea in par_loop
is to keep count
of the number of iterators as yet unexhausted, and replace each
exhausted iterator with a nonterminating iterator that yields the
padding_item
ceaselessly;
num_remaining
counts unexhausted iterators, and both
the yield
statement and the
continuation of the while
loop are
conditional on some iterators being as yet
unexhausted.
Alternatively, if you know in advance which iterable is the
longest one, you can wrap every other iterable
x
as itertools.chain(iter(
x
),
itertools.repeat(padding))
and then call itertools.izip
. You can’t do this wrapping
on all iterables because the resulting iterators
are nonterminating—if you izip
iterators that are all nonterminating, izip
itself cannot terminate! Here, for
example, is a version that works as intended only when the longest
(but terminating!) iterable is the very first one:
import itertools def par_longest_first(padding_item, *sequences): iterators = map(iter, sequences) for i, it in enumerate(iterators): if not i: continue iterators[i] = itertools.chain(it, itertools.repeat(padding_item)) return itertools.izip(iterators)
The itertools
module is part
of the Python Standard Library and is documented in the
Library Reference portion of Python’s online
documentation; the Library Reference and
Python in a Nutshell docs about built-ins
zip
, iter
, and map
.
Credit: Attila Vàsàrhelyi, Raymond Hettinger, Steven Taschuk
You need to loop through every item of multiple iterables cross-productwise, meaning that you first want to get the first item of the first iterable paired with all the others, next, the second item of the first iterable paired with all the others, and so forth.
Say you have two iterables (lists, in this case) such as:
a = ['a1', 'a2', 'a3'] b = ['b1', 'b2']
If you want to loop over their cross-product, the simplest
approach is often just a couple of nested for
loops:
for x in a: for y in b: print x, y
This snippet’s output is six lines:
a1 b1 a1 b2 a2 b1 a2 b2 a3 b1 a3 b2
However, in many cases, you’d rather get all items in the
“cross-product” of multiple iterables as a single, linear sequence,
suitable for using in a single for
or for passing onwards to other sequence manipulation functions, such
as those supplied by itertools
. For
such needs, you may put the nested for
s in a list comprehension:
for x, y in [(x,y) for x in a for y in b]: print x, y
A list comprehension lets you easily generate (as a single, linear sequence) all the pairings of several iterables (also known as the cross-product, product set, or Cartesian product of these iterables). However, the number of items in such a cross-product is the arithmetic product (multiplication) of the lengths of all the iterables involved, a number that may easily get quite large. A list comprehension, by definition, builds the entire list at once, which means that it may consume substantial amounts of memory. Also, you get to start iterating only when the whole cross-product list is entirely built.
Python 2.4 offers one obvious way to solve this problem: the newly introduced construct of generator expressions:
for x, y in ((x,y) for x in a for y in b): print x, y
A generator expression looks just like a list comprehension,
except that it uses parentheses rather than brackets: it returns an
iterator, suitable for looping on, rather than building and returning
a list. Thus, a generator expression can save substantial amounts of
memory, if you are iterating over a very long sequence. Also, you
start executing the loop’s body very soon, since each successive
element gets generated iteratively, before each iteration of the
loop’s body. If your loop’s body contains conditional break
s, so that execution terminates as soon
as some conditions are met, using a generator expression rather than a
list comprehension can mean a potentially substantial improvement in
performance.
If you need to support Python 2.3, and yet you want to achieve the kind of advantages that generator expressions can afford over list comprehensions, the best approach may be to code your own generator. This is quite simple if you only need to deal with a known number of sequences, such as two:
def cross_two(a, b): for x in a: for y in b: yield a, b
Dealing with an arbitrary number of sequences is a bit more complicated, but not terribly so, particularly if we use recursion to help:
def cross_loop(*sequences): if sequences: for x in sequences[0]: for y in cross_loop(sequences[1:]): yield (x,) + y else: yield ( )
We can also do it without recursion. It’s not hard if we’re willing to build the entire result list in memory at once before returning it, just as a list comprehension would:
def cross_list(*sequences): result = [[ ]] for seq in sequences: result = [sublist+[item] for sublist in result for item in seq] return result
Alternatively, you can return
map(tuple, result)
if you need to ensure that each item of
the sequence you return is a tuple, not a list.
Recursion-free iterative (incremental) generation of the “cross-product” sequence is also feasible, even though it’s nowhere as simple as either the recursive or the nonincremental versions:
def cross(*sequences): # visualize an odometer, with "wheels" displaying "digits"...: wheels = map(iter, sequences) digits = [it.next( ) for it in wheels] while True: yield tuple(digits) for i in range(len(digits)-1, -1, -1): try: digits[i] = wheels[i].next( ) break except StopIteration: wheels[i] = iter(sequences[i]) digits[i] = wheels[i].next( ) else: break
In Python 2.4, you might express the for
statement more clearly as for i in
reversed(range(len(digits)))
.
To repeat, it is important to remember that all of these
solutions should be considered only if you do
have the problem—that is, if and only if you do need to view all items
in the “cross-product” of multiple iterables as a single, linear
sequence. Many cases have no such requirement, and simply coding
multiple nested for
loops inline is
quite acceptable, simpler, and more readable. In many cases, getting
all items in the “cross-product” as a single sequence is preferable,
so it’s worth knowing how to do that. However, do keep in mind that
simplicity is an important virtue, and do not lose sight of it in
pursuit of a cool (but complicated) solution. All the cool tools,
constructs, and library modules that Python offers exist strictly to
serve you, to let you build and maintain your
applications with minimal effort. Don’t go out of your way to use the
new shiny tools if you can solve your application’s problems with less
effort in simpler ways!
The Library Reference and
Python in a Nutshell docs about built-ins
iter
, enumerate
, map
, and (Python 2.4 only) reversed
; the Language
Reference and Python in a Nutshell
docs about list comprehensions and (Python 2.4 only) generator
expressions.
Credit: Alex Martelli, Magnus Lie Hetland, Terry Reedy
You need to read a text file (or any other iterable whose items are lines of text) paragraph by paragraph, where a “paragraph” is defined as a sequence of nonwhite lines (i.e., paragraphs are separated by lines made up exclusively of whitespace).
A generator is quite suitable for bunching up lines this way:
def paragraphs(lines, is_separator=str.isspace, joiner=''.join): paragraph = [ ] for line in lines: if is_separator(line): if paragraph: yield joiner(paragraph) paragraph = [ ] else: paragraph.append(line) if paragraph: yield joiner(paragraph) if _ _name_ _ == '_ _main_ _': text = 'a first paragraph and a second one ' for p in paragraphs(text.splitlines(True)): print repr(p)
Python doesn’t directly support paragraph-oriented file reading,
but it’s not hard to add such functionality. We define a “paragraph”
as the string formed by joining a nonempty sequence of nonseparator
lines, separated from any adjoining paragraphs by nonempty sequences
of separator lines. A separator line is one that satisfies the
predicate passed in as argument is_separator
. (A
predicate is a function whose result is taken
as a logical truth value, and we say a
predicate is satisfied
when the predicate returns a result that is true.) By default, a line
is a separator if it is made up entirely of whitespace characters
(e.g., space, tab, newline, etc.).
The recipe’s code is quite straightforward. The state of the
generator during iteration is entirely held in local variable
paragraph
, a list to which we append the nonseparator
lines that make up the current paragraph. Whenever we meet a separator
in the body of the for
statement,
we test if paragraph
to check
whether the list is currently empty. If the list is empty, we’re
already skipping a run of separators and need do nothing special to
handle the current separator line. If the list is not empty, we’ve
just met a separator line that terminates the current paragraph, so we
must join up the list, yield
the
resulting paragraph string, and then set the list back to
empty.
This recipe implements a special case of sequence adaptation by
bunching: an underlying iterable is “bunched up” into another iterable
with “bigger” items. Python’s generators let you express sequence
adaptation tasks very directly and linearly. By passing as arguments,
with reasonable default values, the is_separator
predicate, and the joiner
callable that determines
what happens to each “bigger item” when we’re done bunching it up, we
achieve a satisfactory amount of generality without any extra
complexity. To see this, consider a snippet such as:
import operator numbers = [1, 2, 3, 0, 0, 6, 5, 3, 0, 12] bunch_up = paragraphs for s in bunch_up(numbers, operator.not_, sum): print 'S', s for l in bunch_up(numbers, bool, len): print 'L', l
In this snippet, we use the paragraphs
generator (under the name of bunch_up
, which is
clearer in this context) to get the sums of “runs” of nonzero numbers
separated by runs of zeros, then the lengths of the runs of
zeros—applications that, at first sight, might appear to be quite
different from the recipe’s stated purpose. That’s the magic of
abstraction: when appropriately and tastefully applied, it can easily
turn the solution of a problem into a family of solutions for many
other apparently unrelated problems.
An elementary issue, but a crucial one for getting good
performance in the “main” use case of this recipe, is that the
paragraphs
' generator builds up each resulting
paragraph as a list of strings, then concatenates all strings in the
list with ''.join
to obtain each
result it yield
s. An alternate
approach, where a large string is built up as a string, by repeated
application of +=
or +
, is never the right approach in Python: it
is both slow and clumsy. Good Pythonic style absolutely
demands that we use a list as the intermediate
accumulator, whenever we are building a long string by concatenating a
number of smaller ones. Python 2.4 has diminished the performance
penalty of the wrong approach. For example, to join a list of 52
one-character strings into a 52-character string on my machine, Python
2.3 takes 14.2 microseconds with the right approach, 73.6 with the
wrong one; but Python 2.4 takes 12.7 microseconds with the right
approach, 41.6 with the wrong one, so the penalty in this case has
decreased from over five times to over three. Nevertheless, there is
no reason to choose to pay such a performance penalty without any
returns, even the lower penalty that Python 2.4 manages to
extract!
Python 2.4 offers a new itertools.groupby
function that is quite
suitable for sequence-bunching tasks. Using it, we could express the
paragraphs
' generator in a really tight and concise
way:
from itertools import groupby def paragraphs(lines, is_separator=str.isspace, joiner=''.join): for separator_group, lineiter in groupby(lines, key=is_separator): if not separator_group: yield joiner(lineiter)
itertools.groupby
, like SQL’s
GROUP BY
clause, which inspired it,
is not exactly trivial use, but it can be quite useful indeed for
sequence-bunching tasks once you have mastered it thoroughly.
Recipe 19.11;
Chapter 1 for general issues
about handling text; Chapter
2 for general issues about handling files; Recipe 19.21;
Library Reference documentation on Python 2.4’s
itertools.groupby
.
Credit: Alex Martelli
You have a file that includes long logical lines split over two or more physical lines, with backslashes to indicate that a continuation line follows. You want to process a sequence of logical lines, “rejoining” those split lines.
As usual, our first idea for a problem involving sequences should be a generator:
def logical_lines(physical_lines, joiner=''.join): logical_line = [ ] for line in physical_lines: stripped = line.rstrip( ) if stripped.endswith(''): # a line which continues w/the next physical line logical_line.append(stripped[:-1]) else: # a line which does not continue, end of logical line logical_line.append(line) yield joiner(logical_line) logical_line = [ ] if logical_line: # end of sequence implies end of last logical line yield joiner(logical_line) if _ _name_ _=='_ _main_ _': text = 'some\ ', 'lines\ ', 'get ', 'joined\ ', 'up ' for line in text: print 'P:', repr(line) for line in logical_lines(text, ' '.join): print 'L:', repr(line)
When run as a main script, this code emits:
<c>P: 'some\ ' P: 'lines\ ' P: 'get ' P: 'joined\ ' P: 'up ' L: 'some lines get ' L: 'joined up '</c>
This problem is about sequence-bunching, just like the previous Recipe 19.10. It is therefore not surprising that this recipe, like the previous, is a generator (with an internal structure quite similar to the one in the “other” recipe): today, in Python, sequences are often processed most simply and effectively by means of generators.
In this recipe, the generator can encompass just a small amount
of generality without introducing extra complexity. Determining
whether a line is a continuation line, and of how to proceed when it
is, is slightly too idiosyncratic to generalize in a simple and
transparent way. I have therefore chosen to code that functionality
inline, in the body of the logical_lines
generator,
rather than “factoring it out” into separate callables. Remember,
generality is good, but simplicity is even more important. However, I
have kept the simple and transparent generality obtained by passing
the joiner
function as an argument, and the snippet
of code under the if _ _name_ _= ='_ _main_
_
' test demonstrates how we may want to use that generality,
for example, to join continuation lines with a space rather than with
an empty string.
If you are certain that the file you’re processing is
sufficiently small to fit comfortably in your computer’s memory, with
room to spare for processing, and you don’t need
the feature (offered in the version of logical_lines
shown in the “Solution”) of ignoring whitespace to the right of a
terminating \
, a solution using a
plain function rather than a generator is simpler than the one shown
in this recipe’s Solution:
def logical_lines(physical_lines, joiner=''.join, separator=''): return joiner(physical_lines).replace('\ ', separator).splitlines(True)
In this variant, we join all of the physical lines into one long
string, then we replace the “canceled” line ends (line ends
immediately preceded by a backslash) with nothing (or any other
separator we’re requested to use), and finally split the resulting
long string back into lines (keeping the line ends—that’s what the
True
argument to method splitlines
is for). This approach is a very
different one from that suggested in this recipe but possibly
worthwhile, if physical_lines
is small enough that
you can afford the memory for it. I prefer the “Solution"’s approach
because giving semantic significance to trailing whitespace is a poor
user interface design choice.
Recipe 19.10; Perl Cookbook recipe 8.1; Chapter 1 for general issues about handling text; Chapter 2 for general issues about handling files.
Credit: Scott David Daniels, Peter Cogolo
You want to loop over all lines of a stream, but the stream arrives as a sequence of data blocks of arbitrary size (e.g., from a network socket).
We need to code a generator that gets blocks and yields lines:
def ilines(source_iterable, eol=' ', out_eol=' '): tail = '' for block in source_iterable: pieces = (tail+block).split(eol) tail = pieces.pop( ) for line in pieces: yield line + out_eol if tail: yield tail if _ _name_ _ == '_ _main_ _': s = 'one two , three,four,five ,six, seven last'.split(',') for line in ilines(s): print repr(line)
When run as a main script, this code emits:
'one ' 'two ' 'threefourfive ' 'six ' 'seven ' 'last'
Many data sources produce their data in fits and starts—sockets, RSS feeds, the results of expanding compressed text, and (at its heart) most I/O. The data often doesn’t arrive at convenient boundaries, but you nevertheless want to consume it in logical units. For text, the logical units are often lines.
This recipe shows generator ilines
, a simple
way to consume a source_iterable
, which yields blocks
of data, producing an iterator that yields lines of text instead.
ilines
is vastly simplified by assuming that lines
are separated, on input, by a known end-of-line (EOL) string—by
default '
r
n
', which is the standard EOL marker in
most Internet protocols. ilines
' implementation is
further simplified by taking a high-level approach, relying on the
split
method of Python’s string
types to do most of the work. This basically leaves
ilines
with the single task of “buffering” data
between successive input blocks, on all occasions when a line starts
in one block and ends in a following one (including those occasions in
which block boundaries “split” an EOL marker).
ilines
easily accomplishes its buffering task
through its local variable tail
, which
starts empty and, at each leg of the loop, holds that which followed
the latest EOL marker seen so far. When tail+block
ends with an EOL marker, the
expression (tail+block).split(eol)
produces a list whose last item is an empty string (''), exactly what
we need; otherwise, the last item of the list is that which followed
the last EOL, which again is exactly what we
need.
Python’s built-in file
objects are even more powerful than ilines
, since
they support a universal newlines reading mode
(mode 'U
'), which is able to
recognize and deal with all common EOL markers (even when different
markers are mixed within the same stream!). However,
ilines
is more flexible, since you may apply it in
many situations where you have a stream of arbitrary blocks of text
and want to process it as a stream of lines, with a known EOL
marker.
Library Reference and Python
in a Nutshell docs about built-in file
objects; Chapter 2 for general issues about
handling files.
Credit: Christopher Prinos
You want to fetch a result set from a database (using
the Python DB API) and easily iterate over each record in the result
set. However, you don’t want to use the DB cursor’s method fetchall
: it could consume a lot of memory
and would force you to wait until the whole result set comes back
before you can start iterating.
A generator is the ideal solution to this problem:
def fetchsome(cursor, arraysize=1000): ''' A generator that simplifies the use of fetchmany ''' while True: results = cursor.fetchmany(arraysize) if not results: break for result in results: yield result
In applications that use the Python DB API, you often see code
that goes somewhat like (where cursor
is a DB API
cursor object):
cursor.execute('select * from HUGE_TABLE') for result in cursor.fetchall( ): doSomethingWith(result)
This simple approach is “just” fine, as long as fetchall
returns a small result set, but it
does not work very well if the query result is very large. A large
result set can take a long time to return. Also, cursor.fetchall( )
needs to allocate enough
memory to store the entire result set in memory at once. Further, with
this simple approach, the doSomethingWith
function
isn’t going to get called until the entire query’s result finishes
coming over from the database to our program.
An alternative approach is to rely on the cursor.fetchone
method:
for result in iter(cursor.fetchone, None): doSomethingWith(result)
However, this alternative approach does not allow the database
to optimize the fetching process: most databases can exhibit better
efficiency when returning multiple records for a single query, rather
than returning records one at a time as fetchone
requires.
To let your applications obtain greater efficiency than fetchone
allows, without the risks of
unbounded memory consumption and delay connected to the use of
fetchall
, Python’s DB API’s cursors
also have a fetchmany
method.
However, the direct use of fetchmany
makes your iterations somewhat
more complicated than the simple for
statements such as those just shown. For
example:
while True: results = cursor.fetchmany(1000) if not results: break for result in results: doSomethingWith(result)
Python’s generators are a great way to encapsulate complicated
iteration logic so that application code can just about always loop
with simple for
statements. With
this recipe’s fetchsome
generator, you get the same
efficiencies and safeguards as with the native use of the fetchmany
method in the preceding snippet
but with the same crystal-clear simplicity as in the snippets that
used either fetchall
or fetchone
, namely:
for result in fetchsome(cursor): doSomethingWith(result)
By default, fetchsome
fetches up to 1,000
records at a time, but you can change that number, depending on your
requirements. Optimal values can depend on schema, database type,
choice of Python DB API module. In general, you’re best advised to
experiment with a few different values in your specific settings if
you need to optimize this specific aspect. (Such experimentation is
often a good idea for any optimization task.)
This recipe is clearly an example of a more general case: a subsequence unbuncher generator that you can use when you have a sequence of subsequences (each subsequence being obtained through some call, and the end of the whole sequence being indicated by an empty subsequence) and want to flatten it into a simple, linear sequence of items. You can think of this unbunching task as the reverse of the sequence-bunching tasks covered earlier in Recipe 19.10 and Recipe 19.11, or as a simpler variant of the sequence-flattening task covered in Recipe 4.6. A generator for unbunching might be:
def unbunch(next_subseq, *args): ''' un-bunch a sequence of subsequences into a linear sequence ''' while True: subseq = next_subseq(*args) if not subseq: break for item in subseq: yield item
As you can see, the structure of unbunch
is
basically identical to that of the recipe’s
fetchsome
. Usage would also be just about the
same:
for result in unbunch(cursor.fetchmany, 1000): doSomethingWith(result)
However, while it is important and instructive to consider this
kind of generalization, when you’re writing applications you’re often
better off using specific generators that directly deal with your
application’s specific needs. In this case, for example, calling
fetchsome(cursor)
is more obvious
and direct than calling unbunch(cursor.fetchmany, 1000)
, and
fetchsome
usefully hides the usage of fetchmany
as well as the specific choice of
1,000
as the subsequence size to
fetch at each step.
Recipe 19.10; Recipe 19.11; Recipe 4.6; Python’s DB API is covered in Chapter 7 and in Python in a Nutshell.
Credit: Sébastien Keim, Raymond Hettinger, Danny Yoo
You have several sorted sequences (iterables) and need to iterate on the overall sorted sequence that results from “merging” these sequences.
A generator is clearly the right tool for the job, in the
general case (i.e., when you might not have enough memory to
comfortably hold all the sequences). Implementing the generator is
made easy by the standard library module heapq
, which supplies functions to implement
the “heap” approach to priority queues:
import heapq def merge(*subsequences): # prepare a priority queue whose items are pairs of the form # (current-value, iterator), one each per (non-empty) subsequence heap = [ ] for subseq in subsequences: iterator = iter(subseq) for current_value in iterator: # subseq is not empty, therefore add this subseq's pair # (current-value, iterator) to the list heap.append((current_value, iterator)) break # make the priority queue into a heap heapq.heapify(heap) while heap: # get and yield lowest current value (and corresponding iterator) current_value, iterator = heap[0] yield current_value for current_value in iterator: # subseq is not finished, therefore add this subseq's pair # (current-value, iterator) back into the priority queue heapq.heapreplace(heap, (current_value, iterator)) break else: # subseq has been exhausted, therefore remove it from the queue heapq.heappop(heap)
The need for “merging” sorted subsequences into a larger sorted sequence is reasonably frequent. If the amount of data is small enough to fit entirely in memory without problems, then the best approach is to build a list by concatenating all subsequences, then sort the list:
def smallmerge(*subsequences): result = [ ] for subseq in subsequences: result.extend(subseq) result.sort( ) return result
The sort
method of list
objects is based on a sophisticated natural merge
algorithm, able to take advantage of existing sorted subsequences in
the list you’re sorting; therefore, this approach is quite fast, as
well as simple (and general, since this approach’s correctness does
not depend on all subsequences being already
sorted). If you can choose this approach, it has many other
advantages. For example, smallmerge
works fine even
if one of the subsequences
isn’t perfectly sorted to
start with; and in Python 2.4, you may add a generic keywords argument
**kwds
to
smallmerge
and pass it right along to the result.sort( )
step, to achieve the
flexibility afforded in that version by the cmp=
, key=
, and reverse=
arguments to list’s sort
method.
However, you sometimes deal with large sequences, which might not comfortably fit in memory all at the same time (e.g., your sequences might come from files on disk, or be computed on the fly, item by item, by other generators). When this happens, this recipe’s generator will enable you to perform your sequence merging while consuming a very moderate amount of extra memory (dependent only on the number of subsequences, not on the number of items in the subsequences).
The recipe’s implementation uses a classic sequence-merging
algorithm based on a priority queue, which, in turn, lets it take
advantage of the useful heapq
module in the Python Standard Library. heapq
offers functions to implement a
priority queue through the data structure known as a
heap.
A heap is any list
H
such that, for any valid index 0<=i<len(H)
, H[i]<=H[2*i+1]
, and H[i]<=H[2*i+2]
(if 2*i+1
and 2*i+2
are also valid indices into
H
). This heap property
is fast to establish on an arbitrary list (function heapify
does that) and very fast to
re-establish after altering or removing the smallest item (and
functions heapreplace
and heappop
do that). The smallest item is
always H
[0
]
(it’s easy to see that the “heap” property implies this), and being
able to find the smallest item instantly makes heaps an excellent
implementation of priority queues.
In this recipe, we use as items in the “heap” a “pair” (i.e., two-items tuple) for each subsequence that is not yet exhausted (i.e., each subsequence through which we have not yet fully iterated). As its first item, each pair has the “current item” in the corresponding subsequence and, as its second item, an iterator over that subsequence. At each iteration step, we yield the smallest “current item”, then we advance the corresponding iterator and re-establish the “heap” property; when an iterator is exhausted, we remove the corresponding pair from the “heap” (so that, clearly, we’re finished when the “heap” is emptied). Note the idiom that we use to advance an iterator by one step, dealing with the possibility that the iterator is exhausted:
for current_value in iterator: # if we get here the iterator was not empty, current_value was # its first value, and the iterator has been advanced one step...use pair (current_value, iterator)... # we break at once as we only wanted the first item of iterator break else: # if we get here the break did not execute, so the iterator # was empty (exhausted) # deal with the case of iterator being exhausted...
We use this idiom twice in the recipe, although in the first of
the two uses we do not need the else
clause since we can simply ignore
iterators that are immediately exhausted (they correspond to empty
subsequences, which can be ignored for merging purposes).
If you find this idiom confusing or tricky (because it uses a
for
statement whose body
immediately break
s—i.e., a
statement that looks like a loop but is not really a loop because it
never executes more than once!), you may prefer a different
approach:
try: current_value = iterator.next( ) except StopIteration: # if we get here the iterator was empty (exhausted) # deal with the case of iterator being exhausted... else: # if we get here the iterator was not empty, current_value was # its first value, and the iterator has been advanced one step # use pair (current_value, iterator)...
I slightly prefer the idiom using for
; in my view, it gains in clarity by
putting the normal case (i.e., an unexhausted iterator) first and the
rare case (an exhausted iterator) later. A variant of the try
/except
idiom that has the same property
is:
try: current_value = iterator.next( ) # if we get here the iterator was not empty, current_value was # its first value, and the iterator has been advanced one step# use pair (current_value, iterator)... except StopIteration: # if we get here the iterator was empty (exhausted) # deal with the case of iterator being exhausted...
However, I somewhat dislike this variant (even though it’s quite
OK for the two specific uses of this recipe) because it crucially
depends on the code indicated as "use
pair
"
never raising a StopIteration
exception. As a general
principle, it’s best to use a try
clause’s body that is as small as possible—just the smallest fragment
that you do expect to possibly raise the
exception you’re catching in the following handlers (except
clauses), not
the follow-on code that must execute only if the exception was not
raised. The follow-on code goes in the else
clause of the try
statement, in properly defensive
Pythonic coding style. In any case, as long as you are fully aware of
the tradeoffs in clarity and defensiveness between these three roughly
equivalent idioms, you’re welcome to develop your own distinctive
Pythonic style and, in particular, to choose freely among them!
If you do choose either of the idioms that explicitly call
iterator.next( )
, a further
“refinement” (i.e., a tiny optimization) is to keep as the second item
of each pair, rather than the iterator
object, the bound-method iterator.next
directly, ready for you to
call. This optimization is not really tricky at all (it
is quite common in Python to stash away bound
methods and other such callables), but it may nevertheless result in
code of somewhat lower readability. Once again, the choice is up to
you!
Chapter 5 for general
issues about sorting and Recipe 5.7 and Recipe 5.8 about heapq
specifically; Library
Reference and Python in a Nutshell
documentation on module heapq
and
lists’ sort
method; Robert
Sedgewick, Algorithms (Addison-Wesley) (heaps
are covered starting on p. 178 in the 2d edition); heapq.py in the Python sources contains an
interesting discussion of heaps.
Credit: Ulrich Hoffmann, Guy Argo, Danny Yoo, Carl Bray, Doug Zongker, Gagan Saksena, Robin Houston, Michael Davies
You need to iterate on the permutations, combinations, or selections of a sequence. The fundamental rules of combinatorial arithmetic indicate that the length of these derived sequences are very large even if the starting sequence is of moderate size: for example, there are over 6 billion permutations of a sequence of length 13. So you definitely do not want to compute (and keep in memory) all items in a derived sequence before you start iterating,
Generators enable you to compute needed objects one by one as you iterate on them. The loop inevitably takes a long time if there are vast numbers of such objects and you really need to examine each one. But at least you do not waste memory storing all of them at once:
def _combinators(_handle, items, n): ''' factored-out common structure of all following combinators ''' if n==0: yield [ ] return for i, item in enumerate(items): this_one = [ item ] for cc in _combinators(_handle, _handle(items, i), n-1): yield this_one + cc def combinations(items, n): ''' take n distinct items, order matters ''' def skipIthItem(items, i): return items[:i] + items[i+1:] return _combinators(skipIthItem, items, n) def uniqueCombinations(items, n): ''' take n distinct items, order is irrelevant ''' def afterIthItem(items, i): return items[i+1:] return _combinators(afterIthItem, items, n) def selections(items, n): ''' take n (not necessarily distinct) items, order matters ''' def keepAllItems(items, i): return items return _combinators(keepAllItems, items, n) def permutations(items): ''' take all items, order matters ''' return combinations(items, len(items)) if _ _name_ _=="_ _main_ _": print "Permutations of 'bar'" print map(''.join, permutations('bar')) # emits['bar', 'bra', 'abr', 'arb', 'rba', 'rab']
print "Combinations of 2 letters from 'bar'" print map(''.join, combinations('bar', 2)) # emits['ba', 'br', 'ab', 'ar', 'rb', 'ra']
print "Unique Combinations of 2 letters from 'bar'" print map(''.join, uniqueCombinations('bar', 2)) # emits['ba', 'br', 'ar']
print "Selections of 2 letters from 'bar'" print map(''.join, selections('bar', 2)) # emits['bb', 'ba', 'br', 'ab', 'aa', 'ar', 'rb', 'ra', 'rr']
The generators in this recipe accept any sequence as the
items
argument and always yield lists of length
n
, where n
is
the second argument to the generator (permutations
accepts only one argument, and n
is by
definition equal to len(items)
).
You can modify the recipe so the generators yield tuples (or
instances of another sequence type), instead of lists, by changing two
lines of code in _combinators
. The yield
[ ]
must become yield
( )
(more generally, this statement must
yield the empty sequence of any sequence type you wish to use), and
name this_one
must be bound to the Singleton sequence
of any sequence type you wish to use. For example, to yield tuples,
change the statement that assigns to name this_one
into:
this_one = items[i],
(A subtle, often-forgotten point of Python syntax is that the comma identifies the right side of the assignment as a tuple. Placing parentheses around the right-hand side would be both insufficient and superfluous.)
Another way to modify this recipe is to have the generators
yield sequences of the same type as argument items
.
(As long as this type is indeed a sequence: specifically, it must
support slicing, as well as the use of the plus sign, +
, for concatenation). If that is what you
want, change the yield
of the empty
sequence into:
yield items[:0]
and change the assignment to name this_one
into:
this_one = items[i:i+1]
The definition of distinct items for this recipe’s purposes is: “items that occur at different indices in the input sequence.” If your input sequence has duplicates (i.e., the same item occurring at multiple indices), none of the functions in this recipe will care about removing them: rather, all functions will treat the duplicates as “distinct items” for all purposes.
Recipe 19.16 for another combinatorics building block; Recipe 18.1 and Recipe 18.2.
Credit: David Eppstein, Jan Van lent, George Yoshida
You want to generate all partitions of a
given positive integer, that is, all the ways in which that integer
can be represented as a sum of positive integers (for example, the
partitions of 4
are 1+1+1+1
, 1+1+2
, 2+2
, 1+3
,
and 4
).
A recursive generator offers the simplest approach for this task, as is often the case with combinatorial computations:
def partitions(n): # base case of the recursion: zero is the sum of the empty tuple if n == 0: yield ( ) return # modify the partitions of n-1 to form the partitions of n for p in partitions(n-1): yield (1,) + p if p and (len(p) < 2 or p[1] > p[0]): yield (p[0] + 1,) + p[1:]
Partitions, like permutations, combinations and selections, are among the most basic primitives of combinatorial arithmetic. In other words, such constructs, besides being useful on their own, are building blocks for generating other kinds of combinatorial objects.
This recipe works along classic recursive lines. If you have a
partition of a positive integer n
, you can
reduce it to a partition of n-1
in a
canonical way by subtracting one from the smallest item in the
partition. For example, you can build partitions of 5 from partitions
of 6 by such transformation steps as 1+2+3
=> 2+3
, 2+4 => 1+4
,
and so forth. The algorithm in this recipe reverses the process: for
each partition p
of
n-1
, the algorithm finds the partitions of
n
that would be reduced to
p
by this canonical transformation step.
Therefore, each partition p
of
n
is output exactly once, at the step when
we are considering the partition p1
of
n-1
to which p
canonically reduces.
Be warned: the number of partitions of
n
grows fast when
n
itself grows. Ramanujan’s upper bound for
the number of partitions of a positive integer
k
is:
int(exp(pi*sqrt(2.0*k/3.0))/(4.0*k*sqrt(3.0)))
(where names exp
, pi
and sqrt
are all taken from module math
, in Python terms). For example, the
number 200 has about 4,100 billion partitions.
This recipe generates each partition as a tuple of integers in
ascending order. If it’s handier for your application to deal with
partitions as tuples of integers in descending
order, you need only change the body of the for
loop in the recipe to:
yield p + (1,) if p and (len(p) < 2 or p[-2] > p[-1]): yield p[:-1] + (p[-1] + 1,)
Creating a new tuple per item in the output stream, as this
recipe does, may result in performance issues, if you’re dealing with
a very large n
. One way to optimize this
aspect would be to return lists instead of tuples, and specifically to
return the same list object at each step (with
the descending-order modification, and append
and pop
operations rather than list
concatenation):
def partfast(n): # base case of the recursion: zero is the sum of the empty tuple if n == 0: yield [ ] return # modify the partitions of n-1 to form the partitions of n for p in partfast(n-1): p.append(1) yield p p.pop( ) if p and (len(p) < 2 or p[-2] > p[-1]): p[-1] += 1 yield p
This optimization is not worth the bother—not so much because of
the modest extra complication in partfast
’s own code,
but mostly because yielding the same list object at each step means
that code using partfast
must take precautions. For example,
list(partfast(4))
is a potentially
surprising list of five empty sublists, while list(partitions(4))
is exactly the expected
list of the five partitions of the number 4
.
On the “other” hand, a different approach using an auxiliary parameter can actually produce a simplification for the descending-order case:
def partitions_descending(num, lt=num): if not num: yield ( ) for i in xrange(min(num, lt), 0, -1): for parts in partitions_descending(num-i, i): yield (i,) + parts
This code is simpler than the variant given in the recipe and could be made even clearer in Python 2.4 by changing its outer loop into:
for i in reversed(xrange(1, min(num, lt)-1)):
Recipe 19.15 for more combinatorics building blocks.
Credit: Heiko Wundram, Raymond Hettinger
You have an iterator (or other iterable) object
x
, and need to iterate twice over
x
’s sequence of values.
In Python 2.4, solving this problem is the job of function
tee
in the standard library module
itertools
:
import itertools x1, x2 = itertools.tee(x) # you can now iterate on x1 and x2 separately
In Python 2.3, you can code tee
yourself:
import itertools def tee(iterable): def yield_with_cache(next, cache={ }): pop = cache.pop for i in itertools.count( ): try: yield pop(i) except KeyError: cache[i] = next( ) yield cache[i] it = iter(iterable) return yield_with_cache(it.next), yield_with_cache(it.next)
The need to iterate repeatedly over the same sequence of values is a reasonably common one. If you know that the sequence comes from a list, or some other container that intrinsically lets you iterate over its items repeatedly, then you simply perform the iteration twice. However, sometimes your sequence may come from a generator, a sequential file (which might, e.g., wrap a stream of data coming from a network socket—data that you can read only once), or some other iterator that is not intrinsically re-iterable. Even then, in some cases, the best approach is the simplest one—first save the data into a list in memory, then repeatedly iterate over that list:
saved_x = list(x) for item in saved_x: do_something(item) for item in saved_x: do_something_else(item)
The simple approach of first saving all data from the iterator
into a list is not feasible for an infinite sequence
x
, and may not be optimal if
x
is very large and your separate
iterations over it never get far out-of-step from each other. In these
cases, the tee
function shown in this recipe can
help. For example, say that the items of x
are either numbers or operators (the latter being represented as
strings such as '+
', '*
', etc.). Whenever you encounter an
operator, you must output the result of applying that operator to all
numbers immediately preceding it (since the last operator). Using
tee
, you could code:
def is_operator(item): return isinstance(item, str) def operate(x): x1, x2 = tee(iter(x)) while True: for item in x1: if is_operator(item): break else: # we get here when there are no more operators in the input # stream, thus the operate function is entirely done return if item == '+': total = 0 for item in x2: if is_operator(item): break total += item yield total elif item == '*': total = 1 for item in x2: if is_operator(item): break total *= item yield total
This kind of “look-ahead” usage is pretty typical of many of the
common use cases of tee
. Even in
this case, you might choose the alternative approach of accumulating
items in a list:
def operate_with_auxiliary_list(x): aux = [ ] for item in x: if is_operator(item): if item == '+': yield sum(aux) elif item == '*': total = 1 for item in aux: total *= item yield total aux = [ ] else: aux.append(item)
Having tee
available lets you
freely choose between these different styles of look-ahead
processing.
Function itertools.tee
as
implemented in Python 2.4 is faster and more general than the pure
Python version given in this recipe for version 2.3 usage. However,
the pure Python version is quite instructive and deserves study for
the sake of the techniques it demonstrates, even if you’re lucky
enough to be using Python 2.4 and therefore don’t need to use this
pure Python version of tee
.
In the pure Python version of tee
, the nested
generator yield_with_cache
makes use of the fact
(which some consider a “wart” in Python but is actually quite useful)
that the default values of arguments get computed just once, at the
time the def
statement executes.
Thus, both calls to the nested generator in the return
statement of tee
implicitly share the same initially empty dict
as the value of the
cache
argument.
itertools.count
returns
non-negative integers, 0
and up,
one at a time. yield_with_cache
uses each of these
integers as a key into the cache
dictionary. The call
to pop(i)
(the argument of the
yield
statement in the try
clause) simultaneously returns and
removes the entry corresponding to key i
,
if that entry was present—that is, in this case,
if the “other” instance of the generator had already reached that
point in the iteration (and cached the item for our benefit).
Otherwise, the except
clause
executes, computes the item (by calling the object bound to name
next
, which in this case is the next
bound method of an iterator over the
iterable object, which tee
is
duplicating), and caches the item (for the “other” instance’s future
benefit) before yield
ing it.
So, in practice, cache
is being used as a FIFO
queue. Indeed, were it not for the fact that we don’t need a
pure-Python tee
in Python 2.4, we could code an
equivalent implementation of it in Python 2.4 using the new type
deque
in standard library module
collections
:
import collections def tee_just_an_example(iterable): def yield_with_cache(it, cache=collections.deque): while True: if cache: yield cache.popleft( ) else: result = it.next( ) cache.append(result) yield result it = iter(iterable) return yield_with_cache(it), yield_with_cache(it)
This latest version is meant purely as an illustrative example, and therefore, it’s simplified by not using any of the bound-method extraction idioms shown in the version in the “Solution” (which is intended for “production” use in Python 2.3).
Once you’ve called tee
on an
iterator, you should no longer use the original iterator anywhere
else; otherwise, the iterator could advance without the knowledge of
the tee
-generated objects, and
those objects would then “get out of sync” with the original. Be
warned that tee
requires auxiliary storage that is
proportional to how much the two tee
-generated
objects get “apart” from each other in their separate iterations. In
general, if one iterator is going to walk over most or all of the data
from the original before the “other” one starts advancing, you should
consider using list
instead of
tee
. Both of these caveats apply to the itertools.tee
function of Python 2.4 just as
well as they apply to the pure Python versions of tee
presented in this recipe. One more caveat: again both for the versions
in this recipe, and the itertools.tee
function in Python 2.4, there
is no guarantee of thread safety: to access the tee’d iterators from
different threads, you need to guard those iterators with a single
lock!
The itertools
module is part
of the Python Standard Library and is documented in the
Library Reference portion of Python’s online
documentation; Recipe
19.2 shows how to turn an iterator into a list.
Credit: Steven Bethard, Peter Otten
You are using an iterator for some task such as parsing, which requires you to be able to “look ahead” at the next item the iterator is going to yield, without disturbing the iterator state.
The best solution is to wrap your original iterator into a suitable class, such as the following one (Python 2.4-only):
import collections class peekable(object): """ An iterator that supports a peek operation. Example usage: >>> p = peekable(range(4)) >>> p.peek( ) 0 >>> p.next(1) [0] >>> p.peek(3) [1, 2, 3] >>> p.next(2) [1, 2] >>> p.peek(2) Traceback (most recent call last): ... StopIteration >>> p.peek(1) [3] >>> p.next(2) Traceback (most recent call last): ... StopIteration >>> p.next( ) 3 """ def _ _init_ _(self, iterable): self._iterable = iter(iterable) self._cache = collections.deque( ) def _ _iter_ _(self): return self def _fillcache(self, n): if n is None: n = 1 while len(self._cache) < n: self._cache.append(self._iterable.next( )) def next(self, n=None): self._fillcache(n) if n is None: result = self._cache.popleft( ) else: result = [self._cache.popleft( ) for i in range(n)] return result def peek(self, n=None): self._fillcache(n) if n is None: result = self._cache[0] else: result = [self._cache[i] for i in range(n)] return result
Many iterator-related tasks, such as parsing, require the
ability to “peek ahead” (once or a few times) into the sequence of
items that an iterator is yielding, in a way that does not alter the
iterator’s observable state. One approach is to use the new Python 2.4
function iterator.tee
to get two
independent copies of the iterator, one to be advanced for peeking
purposes and the “other” one to be used as the “main” iterator. It’s
actually handier to wrap the incoming iterator once for all, at the
start, with the class peekable
presented in this
recipe; afterwards, a peek
method, which is safe and
effective, can be counted on. A little added sweetener is the ability
to call peek
(and, as long as we’re at it, the
standard next
method too) with a
specific number argument n
, to request a
list of the next n
items of the iterator
(without disturbing the iterator’s state when you call peek(n)
, with iterator state advancement
when you call next(
n
)
—just
like for normal calls without arguments to the same methods).
The obvious idea used in this recipe for implementing
peekable
is to have it keep a cache of peeked-ahead
arguments. Since the cache must grow at the tail and get consumed from
the end, a natural choice is to make the cache a collections.deque
, a new type introduced in
Python 2.4. However, if you need this code to run under version 2.3 as
well, make self._cache
a list
instead—you only need to change method
next
’s body a little bit, making
it:
if n is None: result = self._cache.pop(0) else: result, self_cache = self._cache[:n], self._cache[n:]
As long as you’re caching only one or just a few items of
lookahead at a time, performance won’t suffer much by making
self._cache
a list
rather than a deque
.
An interesting characteristic of the peekable
class presented in this recipe is that, if you request too many items
from the iterator, you get a StopIteration
exception but that does not
throw away the last few values of the iterator. For example, if
p
is an instance of
peekable
with just three items left, when you call
p.next(5)
, you get a StopIteration
exception. You can later call
p.next(3)
and get the list of the
last three items.
A subtle point is that the n
argument
to methods peek
and next
defaults to None
, not to 1
. This gives you two distinct ways to peek
at a single item: the default way, calling p.peek( )
, just gives you that item, while
calling p.peek(1)
gives you a list
with that single item in it. This behavior is quite consistent with
the way p.peek
behaves when called
with different arguments: any call p.peek(n)
with any non-negative integer
n
returns a list with
n
items (or raises StopIteration
if
p
has fewer than
n
items left). This approach even supports
calls such as p.next(0)
, which in
practice always returns an empty list [
]
without advancing the iterator’s state. Typically, you
just call p.peek( )
, without
arguments, and get one look-ahead item without problems.
As an implementation detail, note that the docstring of the
class peekable
presented in this recipe is
essentially made up of examples of use with expected results. Besides
being faster to write, and arguably to read for an experienced
Pythonista, this style of docstring is perfect for use with the Python
Standard Library module doctest
.
collections.deque
and
doctest
in the
Python Library Reference
(for Python 2.4).
Credit: Jimmy Retzlaff, Paul Moore
You want to code a consumer thread which gets work requests off a queue one at a time, processes each work request, and eventually stops, and you want to code it in the simplest possible way.
This task is an excellent use case for the good old Sentinel idiom. The producer thread, when it’s done putting actual work requests on the queue, must finally put a sentinel value, that is, a value that is different from any possible work request. Schematically, the producer thread will do something like:
for input_item in stuff_needing_work: work_request = make_work_request(input_item) queue.put(work_request) queue.put(sentinel)
where sentinel
must be a “well-known
value”, different from any work_request
object that
might be put on the queue in the first phase.
The consumer thread can then exploit the built-in function
iter
:
for work_request in iter(queue.get, sentinel): process_work_request(work_request) cleanup_and_terminate( )
Were it not for built-in function iter
, the consumer thread would have to use
a slightly less simple and elegant structure, such as:
while True: work_request = queue.get( ) if work_request == sentinel: break process_work_request(work_request) cleanup_and_terminate( )
However, the Sentinel idiom is so useful
and important that Python directly supports it with built-in function
iter
. When you call iter
with just one argument, that argument
must be an iterable object, and iter
returns an iterator for it. But when
you call iter
with two arguments,
the first one must be a callable which can be called without
arguments, and the second one is an arbitrary value known as the
sentinel. In the two-argument case, iter
repeatedly calls the first argument. As
long as each call returns a value !=sentinel
, that value becomes an item in
the iteration; as soon as a call returns a value ==sentinel
, the iteration stops.
If you had to code this yourself as a generator, you could write:
def iter_sentinel(a_callable, the_sentinel): while True: item = a_callable( ) if item == the_sentinel: break yield item
But the point of this recipe is that you don’t have to code even
this simple generator: just use the power that Python gives you as
part of the functionality of the built-in function iter
!
Incidentally, Python offers many ways to make sentinel values—meaning values that compare equal only to themselves. The simplest and most direct way, and therefore the one I suggest you always use for this specific purpose, is:
sentinel = object( )
Documentation for iter
in the
Library Reference and Python in a
Nutshell.
Credit: Garth Kidd
You want to run the code of a generator (or any other iterator) in its own separate thread, so that the iterator’s code won’t block your main thread even if it contains time-consuming operations, such as blocking calls to the operating system.
This task is best tackled by wrapping a subclass of threading.Thread
around the iterator:
import sys, threading class SpawnedGenerator(threading.Thread): def _ _init_ _(self, iterable, queueSize=0): threading.Thread._ _init_ _(self) self.iterable = iterable self.queueSize = queueSize def stop(self): "Ask iteration to stop as soon as feasible" self.stopRequested = True def run(self): "Thread.start runs this code in another, new thread" put = self.queue.put try: next = iter(self.iterable).next while True: # report each result, propagate StopIteration put((False, next( )) if self.stopRequested: raise StopIteration except: # report any exception back to main thread and finish put((True, sys.exc_info( ))) def execute(self): "Yield the results that the "other", new thread is obtaining" self.queue = Queue.Queue(self.queueSize) get = self.queue.get self.stopRequested = False self.start( ) # executes self.run( ) in other thread while True: iterationDone, item = get( ) if iterationDone: break yield item # propagate any exception (unless it's just a StopIteration) exc_type, exc_value, traceback = item if not isinstance(exc_type, StopIteration): raise exc_type, exc_value, traceback def _ _iter_ _(self): "Return an iterator for our executed self" return iter(self.execute( ))
Generators (and other iterators) are a great way to package the
logic that controls an iteration and obtains the next value to feed
into a loop’s body. The code of a generator (and, equivalently, the
code of the next
method of another
kind of iterator) usually runs in the same thread as the code that’s
iterating on it. The “calling” code can therefore
block, each time around the loop, while waiting
for the generator’s code to do its job.
Sometimes, you want to use a generator (or other kind of
iterator) in a “non-blocking” way, which means you need to arrange
things so that the generator’s body runs in a new, separate thread.
This recipe shows a class which supplies exactly this kind of
functionality: this recipe’s SpawnedGenerator
class
subclasses threading.Thread
and
uses Thread
’s start
/run
mechanism to ensure the generator’s body always executes in a separate
thread from that of the calling code.
All communication between the two threads occurs through a
single instance of the Queue.Queue
class (held
through a local-variable bound method in each of the communicating
methods: the generator named execute
that runs in the
calling thread and the method named run
that runs in
a separate thread). The “calling” code may also call method
stop
on the SpawnedGenerator
instance to ask for the iteration to stop as soon as feasible.
Optionally, you may also specify a queue size when you instantiate
SpawnedGenerator
, if you want to limit how far ahead
of the calling thread the spawned thread can get.
The main use case for this recipe is for wrapping iterators that
make blocking calls to the operating system (e.g., walking a directory
tree), when you need to use such iterators in an application where the
“main” thread cannot be allowed to block for a long time. The typical
examples of applications whose main thread must not block are
event-driven applications, a description that applies to applications
with a GUI, as well as to networking applications built on
asynchronous frameworks, such as Twisted or the asyncore
module of the Python Standard
Library.
Library Reference and Python
in a Nutshell docs about modules threading
and asyncore
; Twisted is at http://www.twistedmatrix.com/;
Chapter 9 for general issues
about threading; Chapter 11
for general issues about user interfaces; Chapter 13 and Chapter 14 for general issues
about network and web programming, including asynchronous approaches
to such programs.
Credit: Paul Moore, Raymond Hettinger
You have a list of data grouped by a key value, typically read from a spreadsheet or the like, and want to generate a summary of that information for reporting purposes.
The itertools.groupby
function introduced in Python 2.4 helps with this task:
from itertools import groupby from operator import itemgetter def summary(data, key=itemgetter(0), field=itemgetter(1)): """ Summarise the given data (a sequence of rows), grouped by the given key (default: the first item of each row), giving totals of the given field (default: the second item of each row). The key and field arguments should be functions which, given a data record, return the relevant value. """ for k, group in groupby(data, key): yield k, sum(field(row) for row in group) if _ _name_ _ == "_ _main_ _": # Example: given a sequence of sales data for city within region, # _sorted on region_, produce a sales report by region sales = [('Scotland', 'Edinburgh', 20000), ('Scotland', 'Glasgow', 12500), ('Wales', 'Cardiff', 29700), ('Wales', 'Bangor', 12800), ('England', 'London', 90000), ('England', 'Manchester', 45600), ('England', 'Liverpool', 29700)] for region, total in summary(sales, field=itemgetter(2)): print "%10s: %d" % (region, total)
In many situations, data is available in tabular form, with the
information naturally grouped by a subset of the data values (e.g.,
recordsets obtained from database queries and data read from
spreadsheets—typically with the csv
module of the Python Standard Library). It is often useful to be able
to produce summaries of the detail data.
The new groupby
function
(added in Python 2.4 to the itertools
module of the Python Standard
Library) is designed exactly for the purpose of handling such grouped
data. It takes as arguments an iterator, whose items are to be thought
of as records, along with a function to extract the
key value from each record. itertools.groupby
yields each distinct key
from the iterator in turn, each along with a new iterator that runs
through the data values associated with that key.
The groupby
function is often
used to generate summary totals for a dataset. The
summary
function defined in this recipe shows one
simple way of doing this. For a summary report, two extraction
functions are required: one function to extract the key, which is the
function that you pass to the groupby
function, and another function to
extract the values to be summarized. The recipe uses another
innovation of Python 2.4 for these purposes: the operator.itemgetter
higher-order function:
called with an index i
as its argument.
itemgetter
produces a function
f
such that f(
x
)
extracts the i
th item from x
,
operating just like an indexing
x
[
i
]
.
The input records must be sorted by the given key; if you’re
uncertain about that condition, you can use groubpy(sorted(data, key=key), key)
to
ensure it, exploiting the built-in function sorted
, also new in Python 2.4. It’s quite
convenient that the same key-extraction function can be passed to both
sorted
and groupby
in this idiom. The groupby
function itself does not sort its
input, which gains extra flexibility that may come in handy—although
most of the time you will want to use groupby
only on sorted data. See Recipe 19.10 for a case in
which it’s quite handy to use groupby
on nonsorted
data.
For example, if the sales data was in a CSV file sales.csv, the usage example in the
recipe’s if _ _name_ _ == `_ _main_
_
' section might become:
import csv sales = sorted(cvs.reader(open('sales.csv', 'rb')), key=itemgetter(1)) for region, total in summary(sales, field=itemgetter(2)): print "%10s: %d" % (region, total)
Overall, this recipe provides a vivid illustration of how the
new Python 2.4 features work well together: in addition to the
groupby
function, the operator.itemgetter
used to provide field
extraction functions, and the potential use of the built-in function
sorted
, the recipe also uses a
generator expression as the argument to the sum
built-in function. If you need to
implement this recipe’s functionality in Python 2.3, you can start by
implementing your own approximate version of groupby
, for example as follows:
class groupby(dict): def _ _init_ _(self, seq, key): for value in seq: k = key(value) self.setdefault(k, [ ]).append(value) _ _iter_ _ = dict.iteritems
This version doesn’t include all the features of Python 2.4’s
groupby
, but it’s very simple and
may be sufficient for your purposes. Similarly, you can write your own
simplified versions of functions itemgetter
and sorted
, such as:
def itemgetter(i): def getter(x): return x[i] return getter def sorted(seq, key): aux = [(key(x), i, x) for i, x in enumerate(seq)] aux.sort( ) return [x for k, i, x in aux]
As for the generator expression, you can simply use a list
comprehension in its place—just call sum([field(row) for row in group])
where the
recipe has the same call without the additional square brackets,
[ ]
. Each of these substitutions
will cost a little performance, but, overall, you can build the same
functionality in Python 2.3 as you can in version 2.4—the latter just
is slicker, simpler, faster, neater!
itertools.groupy
, operator.itemgetter
, sorted
, and csv
in the Library
Reference (for Python 2.4).
[1] The Python Development mailing list is the list on which all discussion regarding the development of Python itself is held; see http://mail.python.org/pipermail/python-dev/2002-November/030380.html for this specific subject.