Credit: Tim Peters, PythonLabs
Computer manufacturers of the 1960s estimated that more than 25 percent of the running time on their computers was spent on sorting, when all their customers were taken into account. In fact, there were many installations in which the task of sorting was responsible for more than half of the computing time. From these statistics we may conclude that either (i) there are many important applications of sorting, or (ii) many people sort when they shouldn’t, or (iii) inefficient sorting algorithms have been in common use.
Donald Knuth
The Art of Computer Programming,vol. 3, Sorting and Searching, page 3
Professor Knuth’s masterful work on the topics of sorting and searching spans nearly 800 pages of sophisticated technical text. In Python practice, we reduce it to two imperatives (we read Knuth so you don’t have to):
Many recipes in this chapter illustrate these principles.
The most common theme is using the
decorate-sort-undecorate (DSU) pattern, a general
approach to transforming a sorting problem by creating an auxiliary list
that we can then sort with the default, speedy sort
method. This technique is the single most
useful one to take from this chapter. In fact, DSU is so useful that
Python 2.4 introduced new features to make it easier to apply. Many
recipes can be made simpler in 2.4 as a result, and the discussion of
older recipes have been updated to show how.
DSU relies on an unusual feature of Python’s built-in
comparisons: sequences are compared lexicographically. Lexicographical
order is a generalization to tuples and lists of the everyday rules used
to compare strings (e.g., alphabetical order). The built-in cmp(s1, s2)
, when
s1
and s2
are
sequences, is equivalent to this Python code:
def lexcmp(s1, s2): # Find leftmost nonequal pair. i = 0 while i < len(s1) and i < len(s2): outcome = cmp(s1[i], s2[i]) if outcome: return outcome i += 1 # All equal, until at least one sequence was exhausted. return cmp(len(s1), len(s2))
This code looks for the first unequal corresponding elements. If such an unequal pair is found, that pair determines the outcome. Otherwise, if one sequence is a proper prefix of the other, the prefix is considered to be the smaller sequence. Finally, if these cases don’t apply, the sequences are identical and are considered equal. Here are some examples:
>>> cmp((1, 2, 3), (1, 2, 3)) # identical0
>>> cmp((1, 2, 3), (1, 2)) # first larger because second is a prefix1
>>> cmp((1, 100), (2, 1)) # first smaller because 1<2-1
>>> cmp((1, 2), (1, 3)) # first smaller because 1==1, then 2<3-1
An immediate consequence of lexicographical comparison is that if you want to sort a list of objects by a primary key, breaking ties by comparing a secondary key, you can simply build a list of tuples, in which each tuple contains the primary key, secondary key, and original object, in that order. Because tuples are compared lexicographically, this automatically does the right thing. When comparing tuples, the primary keys are compared first, and if (and only if) the primary keys are equal, the secondary keys are compared.
The examples of the DSU pattern in this chapter show many
applications of this idea. The DSU technique applies to any number of
keys. You can add to the tuples as many keys as you like, in the order
in which you want the keys compared. In Python 2.4, you can get the same
effect with the new key=
optional
argument to sort
, as several recipes
point out. Using the sort
method’s
key=
argument is easier, more
memory-efficient, and runs faster than building an auxiliary list of
tuples by hand.
The other 2.4-introduced innovation in sorting is a convenient
shortcut: a sorted
built-in function
that sorts any iterable, not in-place, but by first copying it into a
new list. In Python 2.3 (apart from the new optional keyword arguments,
which apply to the sorted
built-in
function as well as to list.sort
),
you can code the same functionality quite easily:
def sorted_2_3(iterable): alist = list(iterable) alist.sort( ) return alist
Because copying a list and sorting it are both nontrivial
operations, and the built-in sorted
needs to perform those operations too, no speed advantage is gained in
making sorted
a built-in. Its
advantage is just the convenience. Having something always around and
available, rather than having to code even just four simple lines over
and over, does make a difference in practice. On
the other hand, few tiny functions are used commonly enough to justify
expanding the set of built-ins. Python 2.4 added sorted
and reversed
because those two functions were
requested very frequently over the years.
The biggest change in Python sorting since the first
edition of this book is that Python 2.3 moved to a new implementation of
sorting. The primary visible consequences are increased speed in many
common cases, and the fact that the new sort is stable (meaning that
when two elements compare equal in the original list, they retain their
relative order in the sorted list). The new implementation was so
successful, and the chances of improving on it appeared so slim, that
Guido was persuaded to proclaim that Python’s list.sort
method will always be stable. This
guarantee started with Python 2.4 but was actually realized in Python
2.3. Still, the history of sorting cautions us that better methods may
yet be discovered. A brief account of Python’s sorting history may be
instructive in this regard.
In early releases of Python, list.sort
used the qsort
routine from the underlying platform’s
C library. This didn’t work out for several reasons, primarily because
the quality of qsort
varied widely
across machines. Some versions were extremely slow when given a list
with many equal values or in reverse-sorted order. Some versions even
dumped core because they weren’t reentrant. A user-defined _ _cmp_ _
function can also invoke list.sort
, so that one list.sort
can invoke others as a side effect
of comparing. Some platform qsort
routines couldn’t handle that. A user-defined _ _cmp_ _
function can also (if it’s insane
or malicious) mutate the list while it’s being sorted, and many
platform qsort
routines dumped core
when that happened.
Python then grew its own implementation of the quicksort algorithm. This was rewritten with every release, as real-life cases of unacceptable slowness were discovered. Quicksort is a delicate algorithm indeed!
In Python 1.5.2 the quicksort algorithm was replaced by a hybrid of samplesort and binary insertion sort, and that implementation remained unchanged for more than four years, until Python 2.3. Samplesort can be viewed as a variant of quicksort that uses a very large sample size to pick the partitioning element, also known as the pivot (it recursively samplesorts a large random subset of the elements and picks the median of those). This variant makes quadratic-time behavior almost impossible and brings the number of comparisons in the average case much closer to the theoretical minimum.
However, because samplesort is a complicated algorithm, it has too much administrative overhead for small lists. Therefore, small lists (and small slices resulting from samplesort partitioning) were handled by a separate binary insertion sort, which is an ordinary insertion sort, except that it uses binary search to determine where each new element belongs. Most sorting texts say this isn’t worth the bother, but that’s because most texts assume that comparing two elements is as cheap as or cheaper than swapping them in memory, which isn’t true for Python’s sort! Moving an object is very cheap, since what is copied is just a reference to the object. Comparing two objects is expensive, though, because all of the object-oriented machinery for finding the appropriate code to compare two objects and for coercion gets reinvoked each time. This made binary search a major win for Python’s sort.
On top of this hybrid approach, a few common special cases were exploited for speed. First, already-sorted or reverse-sorted lists were detected and handled in linear time. For some applications, these kinds of lists are very common. Second, if an array was mostly sorted, with just a few out-of-place elements at the end, the binary insertion sort handled the whole job. This was much faster than letting samplesort have at it and occurred often in applications that repeatedly sort a list, append a few new elements, then sort it again. Finally, special code in the samplesort looked for stretches of equal elements, so that the slice they occupy could be marked as done early.
In the end, all of this yielded an in-place sort with excellent performance in all known real cases and supernaturally good performance in some common special cases. It spanned about 500 lines of complicated C code, which gives special poignancy to recipe Recipe 5.11.
Over the years samplesort was in use, I made a standing offer to buy dinner for anyone who could code a faster Python sort. Alas, I ate alone. Still, I kept my eye on the literature because several aspects of the samplesort hybrid were irritating:
While no case of quadratic-time behavior appeared in real life, I knew such cases could be contrived, and it was easy to contrive cases two or three times slower than average ones.
The special cases to speed sorting in the presence of extreme partial order were valuable in practice, but my real data often had many other kinds of partial order that should be exploitable. In fact, I came to believe that random ordering in input lists almost never exists in real life (i.e., not outside of timing harnesses for testing sorting algorithms!).
There is no practical way to make samplesort stable without grossly increasing memory use.
The code was very complex and complicated in ugly ways by the special cases.
It was always clear that a mergesort would be better on
several counts, including guaranteed worst-case n log n
time, and that mergesort is easy to
make stable. The problem was that half a dozen attempts to code a
mergesort for Python yielded a sort that ran slower (mergesort does
much more data movement than samplesort) and consumed more
memory.
A large and growing literature concerns adaptive sorting algorithms, which attempt to detect order of various kinds in the input. I coded a dozen of them, but they were all much slower than Python’s samplesort except on the cases they were designed to exploit. The theoretical bases for these algorithms were simply too complex to yield effective practical algorithms. Then I read an article pointing out that list merging naturally reveals many kinds of partial order, simply by paying attention to how often each input list “wins” in a row. This information was simple and general. When I realized how it could be applied to a natural mergesort, which would obviously exploit all the kinds of partial order I knew and cared about, I got obsessed enough to solve the speed problem for random data and to minimize the memory burden.
The resulting “adaptive, natural, stable” mergesort implemented for Python 2.3 was a major success, but also a major engineering effort—the devil is in the details. There are about 1,200 lines of C code, but unlike the code in the samplesort hybrid, none of these lines are coding for special cases, and about half implement a technical trick allowing the worst-case memory burden to be cut in half. I’m quite proud of it, but the margins of this introduction lack the space for me to explain the details. If you’re curious, I wrote a long technical description that you can find in a Python source distribution: Objects/listsort.txt under the main directory (say, Python-2.3.5 or Python-2.4) where you unpacked Python’s source distribution archive. In the following list, I provide examples of the partial order Python 2.3’s mergesort naturally exploits, where “sorted” means in either forward-sorted or reverse-sorted order:
The input is already sorted.
The input is mostly sorted but has random elements appended at either end, or both, or inserted in the middle.
The input is the concatenation of two or more sorted lists.
In fact, the fastest way to merge multiple sorted lists in Python
now is to join them into one long list and run list.sort
on that.
The input is mostly sorted but has some scattered elements that are out of order. This is common, for example, when people manually add new records to a database sorted by name: people aren’t good at maintaining strict alphabetic order but are good at getting close.
The input has many keys with the same value. For example, when sorting a database of American companies by the stock exchange they’re listed on, most will be associated with the NYSE or NASDAQ exchanges. This is exploitable for a curious reason: records with equal keys are already in sorted order, by the definition of “stable”! The algorithm detects that naturally, without code especially looking for equal keys.
The input was in sorted order but got dropped on the floor in chunks; the chunks were reassembled in random order, and to fight boredom, some of the chunks were riffle-shuffled together. While that’s a silly example, it still results in exploitable partial order and suggests how general the method is.
In short, Python 2.3’s timsort (well, it has to have some brief name) is stable, robust, and preternaturally fast in many real-life cases: use it any time you can!
Credit: Alex Martelli
You want to sort a dictionary. This probably means that you want to sort the keys and then get the values in that same sorted order.
The simplest approach is exactly the one expressed by the problem statement: sort the keys, then pick the corresponding values:
def sortedDictValues(adict): keys = adict.keys( ) keys.sort( ) return [adict[key] for key in keys]
The concept of sorting applies only to a collection that has an order—in other words, a sequence. A mapping, such as a dictionary, has no order, so it cannot be sorted. And yet, “How do I sort a dictionary?” is a frequent, though literally meaningless, question on the Python lists. More often than not, the question is in fact about sorting some sequence composed of keys and/or values from the dictionary.
As for the implementation, while one could think of more
sophisticated approaches, it turns out (not unusually, for Python)
that the one shown in the solution, the simplest one, is also
essentially the fastest one. A further slight increase in speed, about
20%, can be squeezed out in Python 2.3 by replacing the list
comprehension with a map
call in
the return
statement at the end of
the function. For example:
return map(adict.get, keys)
Python 2.4, however, is already measurably faster than Python
2.3 with the version in the “Solution” and gains nothing from this
further step. Other variants, such as using adict._ _getitem_ _
instead of adict.get
, offer no further increase in
speed, or they even slow performance down a little, in both Python 2.3
and 2.4.
Recipe 5.4 for sorting a dictionary based on its values rather than on its keys.
Credit: Kevin Altis, Robin Thomas, Guido van Rossum, Martin V. Lewis, Dave Cross
You want to sort a list of strings, ignoring case differences.
For example, you want a
, although it’s
lowercase, to sort before B
, although the
latter is uppercase. By default, however, string comparison is
case-sensitive (e.g., all uppercase letters sort before all lowercase
ones).
The decorate-sort-undecorate (DSU) idiom is simple and fast:
def case_insensitive_sort(string_list): auxiliary_list = [(x.lower( ), x) for x in string_list] # decorate auxiliary_list.sort( ) # sort return [x[1] for x in auxiliary_list] # undecorate
In Python 2.4, DSU is natively supported, so (assuming the items
of string_list
are indeed strings, and not,
e.g., Unicode objects), you can use the following even shorter and
faster approach:
def case_insensitive_sort(string_list): return sorted(string_list, key=str.lower)
An obvious alternative to this recipe’s Solution is to code a
comparison function and pass it to the sort
method:
def case_insensitive_sort_1(string_list): def compare(a, b): return cmp(a.lower( ), b.lower( )) string_list.sort(compare)
However, in this way the lower
method gets called twice for every
comparison, and the number of comparisons needed to sort a list of
n
items is typically proportional to
n log(n)
.
The DSU idiom builds an auxiliary list, whose items are tuples
where each item of the original list is preceded by a “key”. The sort
then takes place on the key, because Python compares tuples
lexicographically (i.e., it compares the
tuples’ first items first). With DSU, the lower
method gets called only
n
times to sort a list of
n
strings, which saves enough time to cover
the small costs of the first, decorate step and
the final, undecorate step, with a big net
increase in speed.
DSU is also sometimes known, not quite correctly, as the Schwartzian Transform, by somewhat imprecise analogy with a well-known idiom of the Perl language. (If anything, DSU is closer to the Guttman-Rosler Transform, see http://www.sysarch.com/perl/sort_paper.html.)
DSU is so important that Python 2.4 supports it directly:
you can optionally pass to the sort
method of a list an argument named key
, which is the callable to use on each
item of the list to obtain the key for the sort. If you pass such an
argument, the sorting internally uses DSU. So, in Python 2.4, string_list.sort(key=str.lower
is
essentially equivalent to function
case_insensitive_sort
, except the sort
method sorts the list in-place (and
returns None
) instead of returning
a sorted copy and leaving the original list alone. If you want
function case_insensitive_sort
to sort in-place, by
the way, just change its return
statement into an assignment to the list’s body:
string_list[:] = [x[1] for x in auxiliary_list]
Vice versa, if, in Python 2.4, you want to get a sorted
copy and leave the original list alone, you can use the new built-in
function sorted
. For example, in
Python 2.4:
for s in sorted(string_list, key=str.lower): print s
prints each string in the list, sorted case-insensitively,
without affecting string_list
itself.
The use of str.lower
as the
key
argument in the Python 2.4
Solution restricts you to specifically sorting strings (not, e.g.,
Unicode objects). If you know you’re sorting a list of Unicode
objects, use key=unicode.lower
instead. If you need a function that applies just as well to strings
and Unicode objects, you can import
string
and then use key=string.lower
; alternatively, you could
use key=lambda s: s.lower(
)
.
If you need case-insensitive sorting of lists of strings, you
might also need dictionaries and sets using case-insensitive strings
as keys, lists behaving case-insensitively regarding such methods as
index
and count
, case-insensitive results from
needle
in
haystack
, and so on. If that is the case,
then your real underlying need is a subtype of str
that behaves case-insensitively in
comparison and hashing—a clearly better factoring of the issue,
compared to implementing many container types and functions to get all
of this functionality. To see how to implement such a type, see Recipe 1.24.
The Python Frequently Asked Questions http://www.python.org/cgi-bin/faqw.py?req=show&file=faq04.051.htp;
Recipe 5.3; Python 2.4
Library Reference about the sorted
built-in function and the key
argument to sort
and sorted
; Recipe 1.24.
Credit: Yakov Markovitch, Nick Perkins
The DSU idiom shines, as usual:
def sort_by_attr(seq, attr): intermed = [ (getattr(x, attr), i, x) for i, x in enumerate(seq) ] intermed.sort( ) return [ x[-1] for x in intermed ] def sort_by_attr_inplace(lst, attr): lst[:] = sort_by_attr(lst, attr)
In Python 2.4, DSU is natively supported, so your code can be even shorter and faster:
import operator def sort_by_attr(seq, attr): return sorted(seq, key=operator.attrgetter(attr)) def sort_by_attr_inplace(lst, attr): lst.sort(key=operator.attrgetter(attr))
Sorting a list of objects by an attribute of each object is best done using the DSU idiom introduced previously in Recipe 5.2. In Python 2.3 and 2.4, DSU is no longer needed, as it used to be, to ensure that a sort is stable (sorting is always stable in Python 2.3 and later), but DSU’s speed advantages still shine.
Sorting, in the general case and with the best
algorithms, is O(
n
log
n
)
(as
is often the case in mathematical formulas, the juxtaposition of
terms, here n
and log
n
, indicates
that the terms are multiplied). DSU’s speed comes from maximally
accelerating the O(
n
log
n
)
part, which dominates sorting time for sequences of substantial length
n
, by using only Python’s native (and
maximally fast) comparison. The preliminary
decoration step, which prepares an intermediate
auxiliary list of tuples, and the successive
undecoration step, which extracts the important
item from each tuple after the intermediate list is sorted, are only
O(
n
)
. Therefore any minor
inefficiencies in these steps contribute negligible overhead if
n
is large enough, and reasonably little
even for many practical values of n
.
This recipe puts index i
, in each
tuple that is an item of list intermed
,
ahead of the corresponding
x
(where x
is
the i
-th item in
seq
). This placement ensures that two items
of seq
will never be compared directly,
even if they have the same value for the attribute named
attr
. Even in that case, their indices will
still differ, and thus Python’s lexicographic comparison of the tuples
will never get all the way to comparing the tuples’ last items (the
original items from seq
). Avoiding object
comparisons may save us from performing extremely slow operations, or
even from attempting forbidden ones. For example, we could sort a list
of complex numbers by their real
attribute: we would get an exception if we ever tried to compare two
complex numbers directly, because no ordering is defined on complex
numbers. But thanks to the precaution described in this paragraph,
such an event can never occur, and the sorting will therefore proceed
correctly.
As mentioned earlier in Recipe 5.2, Python 2.4
supports DSU directly. You can pass an optional keyword-argument
key
, to sort
, which is the callable to use on each
item to get the sort key. Standard library module operator
has two new functions, attrgetter
and itemgetter
, that exist specifically to
return callables suitable for this purpose. In Python 2.4, the ideal
solution to this problem therefore becomes:
import operator seq.sort(key=operator.attrgetter(attr))
This snippet performs the sort in-place, which helps
make it blazingly fast—on my machine, three times faster than the
Python 2.3 function shown first in this recipe. If you need a sorted
copy, without disturbing seq
, you can get
it using Python 2.4’s new built-in function sorted
:
sorted_copy = sorted(seq, key=operator.attrgetter(attr))
While not quite as fast as an in-place sort, this latest snippet
is still over 2.5 times faster than the function shown first in this
recipe. Python 2.4 also guarantees that, when you pass the optional
key
named argument, list items will
never be accidentally compared directly, so you need not take any
special safeguards. Moreover, stability is also guaranteed.
Recipe 5.2;
Python 2.4’s Library Reference docs about the
sorted
built-in function, operator
module’s attrgetter
and itemgetter
functions, and the key
argument to .sort
and sorted
.
Credit: John Jensen, Fred Bremmer, Nick Coghlan
You need to count the occurrences of various items and present the items in order of their number of occurrences—for example, to produce a histogram.
A histogram, apart from graphical issues, is based on
counting the occurrences of items (easy to do with a Python list or
dictionary) and then sorting the keys or indices in an order based on
corresponding values. Here is a subclass of dict
that adds two methods for the
purpose:
class hist(dict): def add(self, item, increment=1): ''' add 'increment' to the entry for 'item' ''' self[item] = increment + self.get(item, 0) def counts(self, reverse=False): ''' return list of keys sorted by corresponding values ''' aux = [ (self[k], k) for k in self ] aux.sort( ) if reverse: aux.reverse( ) return [k for v, k in aux]
If the items you’re counting are best modeled by small integers in a compact range, so that you want to keep item counts in a list, the solution is quite similar:
class hist1(list): def _ _init_ _(self, n): ''' initialize this list to count occurrences of n distinct items ''' list._ _init_ _(self, n*[0]) def add(self, item, increment=1): ''' add 'increment' to the entry for 'item' ''' self[item] += increment def counts(self, reverse=False): ''' return list of indices sorted by corresponding values ''' aux = [ (v, k) for k, v in enumerate(self) ] aux.sort( ) if reverse: aux.reverse( ) return [k for v, k in aux]
The add
method of hist
embodies the normal Python idiom for counting occurrences of arbitrary
(but hashable) items, using a dict
to hold the counts. In class hist1
, based on a
list
, we take a different approach,
initializing all counts to 0 in _ _init_
_
, so the add
method is even
simpler.
The counts
methods produce
the lists of keys, or indices, sorted in the order given by the
corresponding values. The problem is very similar in both classes,
hist
and hist1
; therefore, the
solutions are also almost identical, using in each case the DSU
approach already shown in Recipe 5.2 and Recipe 5.3. If we need both
classes in our program, the similarity is so close that we should
surely factor out the commonalities into a single auxiliary function
_sorted_keys
:
def _sorted_keys(container, keys, reverse): ''' return list of 'keys' sorted by corresponding values in 'container' ''' aux = [ (container[k], k) for k in keys ] aux.sort( ) if reverse: aux.reverse( ) return [k for v, k in aux]
and then implement the counts
methods of each
class as thin wrappers over this _sorted_keys
function:
class hist(dict):... def counts(self, reverse=False): return _sorted_keys(self, self, reverse) class hist1(list): ... def counts(self, reverse=False): return _sorted_keys(self, xrange(len(self)), reverse)
DSU is so important that in Python 2.4, as shown previously in
Recipe 5.2 and Recipe 5.3, the sort
method of lists and the new built-in
function sorted
offer a fast,
intrinsic implementation of it. Therefore, in Python 2.4, function
_sorted_keys
can become much simpler and
faster:
def _sorted_keys(container, keys, reverse): return sorted(keys, key=container._ _getitem_ _, reverse=reverse)`
The bound-method container._
_getitem_ _
performs exactly the same operation as the
indexing container[k]
in the Python
2.3 implementation, but it’s a callable to call on each
k
of the sequence that we’re sorting,
namely keys
—exactly the right kind of value to pass
as the key
keyword argument to the
sorted
built-in function. Python
2.4 also affords an easy, direct way to get a list of a dictionary’s
items sorted by value:
from operator import itemgetter def dict_items_sorted_by_value(d, reverse=False): return sorted(d.iteritems( ), key=itemgetter(1), reverse=reverse)
The operator.itemgetter
higher-order function,
also new in Python 2.4, is a handy way to supply the key
argument when you want to sort a
container whose items are subcontainers, keying on a certain item of
each subcontainer. This is exactly the case here, since a dictionary’s
items are a sequence of pairs (two-item tuples), and we want to sort
the sequence keying on the second item of each tuple.
Getting back to this recipe’s main theme, here is a usage
example for the class hist
shown in this recipe’s
Solution:
sentence = ''' Hello there this is a test. Hello there this was a test, but now it is not. ''' words = sentence.split( ) c = hist( ) for word in words: c.add(word) print "Ascending count:" print c.counts( ) print "Descending count:" print c.counts(reverse=True)
This code snippet produces the following output:
Ascending count: [(1, 'but'), (1, 'it'), (1, 'not.'), (1, 'now'), (1, 'test,'), (1, 'test.'), (1, 'was'), (2, 'Hello'), (2, 'a'), (2, 'is'), (2, 'there'), (2, 'this')] Descending count: [(2, 'this'), (2, 'there'), (2, 'is'), (2, 'a'), (2, 'Hello'), (1, 'was'), (1, 'test.'), (1, 'test,'), (1, 'now'), (1, 'not.'), (1, 'it'), (1, 'but')]
Recipe “Special Method Names” in the Language
Reference and the OOP chapter in Python in a
Nutshell, about special method _
_getitem_ _
; Library Reference docs
for Python 2.4 sorted
built-in and
the key=
argument to sort
and sorted
.
Credit: Sébastien Keim, Chui Tey, Alex Martelli
You need to sort a list of strings that contain
substrings of digits (e.g., a list of postal addresses) in an order
that looks good. For example, 'foo2.txt
' should come before 'foo10.txt
‘. However, Python’s default string
comparison is alphabetical, so, by default, 'foo10.txt
' instead comes before 'foo2.txt
‘.
You need to split each string into sequences of digits and nondigits, and transform each sequence of digits into a number. This gives you a list that is just the right comparison key for the sort you want, and you can then use DSU for the sort itself—that is, code two functions, shorter than this description:
import re re_digits = re.compile(r'(d+)') def embedded_numbers(s): pieces = re_digits.split(s) # split into digits/nondigits pieces[1::2] = map(int, pieces[1::2]) # turn digits into numbers return pieces def sort_strings_with_embedded_numbers(alist): aux = [ (embedded_numbers(s), s) for s in alist ] aux.sort( ) return [ s for _ _, s in aux ] # convention: _ _ means "ignore"
In Python 2.4, use the native support for DSU, with the same
function embedded_numbers
to get the sort key:
def sort_strings_with_embedded_numbers(alist): return sorted(alist, key=embedded_numbers)
Say you have an unsorted list of filenames, such as:
files = 'file3.txt file11.txt file7.txt file4.txt file15.txt'.split( )
If you just sort and print this list, for example in Python 2.4
with print
' '.join(sorted(files))
, your output looks like
file11.txt file15.txt file3.txt
file4.txt file7.txt
, since, by
default, strings are sorted alphabetically (to use a fancier word, the
sort order is described as lexicographical).
Python cannot just guess that you mean to treat in a different way
those substrings that happen to be made of digits; you have to tell
Python precisely what you want, and this recipe shows how.
Using this recipe, you can get a nicer-looking result:
print ' '.join(sort_strings_with_embedded_numbers(files))
The output is now file3.txt file4.txt
file7.txt file11.txt file15.txt
, which is probably just what
you want in this case.
The implementation relies on the DSU idiom. We need to
code DSU explicitly if we want to support Python 2.3, while if our
code is Python 2.4-only, we just rely on the native implementation of
DSU. We do so by passing an argument named
key
(a function to be called on each item
to get the right comparison key for the sort) to the new built-in
function sorted
.
Function embedded_numbers
in the recipe is how
we get the right comparison key for each item: a list alternating
substrings of nondigits, and the int
obtained from each substring of digits.
re_digits.split(s)
gives us a list
of alternating substrings of nondigits and digits (with the substrings
of digits at odd-numbered indices); then, we use built-in functions
map
and int
(and extended-form slices that get and
set all items at odd-numbered indices) to turn sequences of digits
into integers. Lexicographical comparison on this list of mixed types
now produces just the right result.
Library Reference and Python
in a Nutshell docs about extended slicing and about module
re
; Python 2.4 Library
Reference about the sorted
built-in function and the key
argument to sort
and sorted
; Recipe 5.3; Recipe 5.2.
Credit: Iuri Wickert, Duncan Grisby, T. Warner, Steve Holden, Alex Martelli
As usual in Python, the best approach is the simplest one. If we are allowed to change the order of items in the input list, then the following function is simplest and fastest:
def process_all_in_random_order(data, process): # first, put the whole list into random order random.shuffle(data) # next, just walk over the list linearly for elem in data: process(elem)
If we must preserve the input list intact, or if the input data
may be some iterable that is not a list, just insert as the first
statement of the function the assignment data
=
list(data)
.
While it’s a common mistake to be overly concerned with
speed, don’t make the opposite mistake of ignoring the different
performances of various algorithms. Suppose we must process all of the
items in a long list in random order, without repetition (assume that
we’re allowed to mangle or destroy the input list). The first idea to
suggest itself might be to repeatedly pick an item at random (with
function random.choice
), removing
each picked item from the list to avoid future repetitions:
import random def process_random_removing(data, process): while data: elem = random.choice(data) data.remove(elem) process(elem)
However, this function is painfully slow, even for input lists
of just a few hundred elements. Each call to data.remove
must linearly search through the
list to find the element to delete. Since the cost of each of
n
steps is O(n)
, the whole process is O(
n
2)
—time proportional to the
square of the length of the list (and with a
large multiplicative constant, too).
Minor improvements to this first idea could focus on obtaining
random indices, using the pop
method of the list to get and remove an item at the same time,
low-level fiddling with indices to avoid the costly removal in favor
of swapping the picked item with the last yet-unpicked one towards the
end, or using dictionaries or sets instead of lists. This latest idea
might be based on a hope of using a dict
’s popitem
method (or the equivalent method
pop
of class sets.Set
and Python 2.4’s built-in type
set
), which may look like it’s
designed exactly to pick and remove a random item, but,
beware! dict.popitem
is documented to return and
remove an arbitrary item of the dictionary, and
that’s a far cry from a random item. Check it
out:
>>> d=dict(enumerate('ciao')) >>> while d: print d.popitem( )
It may surprise you, but in most Python implementations this
snippet will print d
’s items in a far from
random order, typically (0,'c')
then (1,'i')
and so forth. In
short, if you need pseudo-random behavior in Python, you need standard
library module random
—popitem
is not an alternative.
If you thought about using a dictionary rather than a list, you
are definitely on your way to “thinking Pythonically”, even though it
turns out that dictionaries wouldn’t provide a substantial performance
boost for this specific problem. However, an approach that is even
more Pythonic than choosing the right data structure is best
summarized as: let the standard library do it!. The Python Standard
Library is large, rich, and chock full of useful, robust, fast
functions and classes for a wide variety of tasks. In this case, the
key intuition is realizing that, to walk over a sequence in a random
order, the simplest approach is to first put that
sequence into random order (known as shuffling the
sequence, an analogy with shuffling a deck of cards) and
then walk over the shuffled sequence linearly. Function random.shuffle
performs the shuffling, and
the function shown in this recipe’s Solution just uses it.
Performance should always be measured, never guessed at,
and that’s what standard library module timeit
is for. Using a null process
function and a list of length 1,000
as data
,
process_all_in_random_order
is almost 10 times faster
than process_random_removing
; with a list of length
2,000, the performance ratio grows to almost 20. While an improvement
of, say, 25%, or even a constant factor of 2, usually can be neglected
without really affecting the performance of your program as a whole,
the same does not apply to an algorithm that is 10 or 20 times as slow
as it could be. Such terrible performance is likely to make that
program fragment a bottleneck, all by itself. Moreover, this risk
increases when we’re talking about O(
n
2)
versus O(
n
)
behavior: with such differences in big-O
behavior, the performance ratio between bad and good algorithms keeps
increasing without bounds as the size of the input data grows.
The documentation for the random
and timeit
modules in the Library
Reference and Python in a
Nutshell.
Credit: John Nielsen
You want to maintain a sequence, to which items are added, in a sorted state, so that at any time, you can easily examine or remove the smallest item currently present in the sequence.
Say you start with an unordered list, such as:
the_list = [903, 10, 35, 69, 933, 485, 519, 379, 102, 402, 883, 1]
You could call the_list.sort(
)
to make the list sorted and then result=the_list.pop(0)
to get and remove the
smallest item. But then, every time you add an item (say with the_list.append(0)
), you need to call
the_list.sort( )
again to keep the
list sorted.
Alternatively, you can use the heapq
module of the Python Standard
Library:
import heapq heapq.heapify(the_list)
Now the list is not necessarily fully sorted, but it
does satisfy the heap property (meaning if all
indices involved are valid, the_list[i]<=the_list[2*i+1]
and the_list[i]<=the_list[2*i+2]
)—so, in
particular, the_list[0]
is the
smallest item. To keep the heap property valid, use result=heapq.heappop(the_list)
to get and
remove the smallest item and heapq.heappush(the_list, newitem)
to add a
new item. When you need to do both—add a new item while getting and
removing the previously smallest item—you can use result=heapq.heapreplace(the_list
, newitem)
.
When you need to retrieve data in an ordered way (at each
retrieval getting the smallest item among those you currently have at
hand), you can pay the runtime cost for the sorting when you retrieve
the data, or you can pay for it when you add the data. One approach is
to collect your data into a list and sort the list. Now it’s easy to
get your data in order, smallest to largest. However, you have to keep
calling sort
each time you add new
data during the retrieval, to make sure you can later keep retrieving
from the smallest current item after each addition. The method
sort
of Python lists is implemented
with a little-known algorithm called Natural
Mergesort, which minimizes the runtime cost of this
approach. Yet the approach can still be burdensome: each addition (and
sorting) and each retrieval (and removal, via pop
) takes time proportional to the number
of current items in the list (O(N)
,
in common parlance).
An alternative approach is to use a data organization
known as a heap, a type of binary tree
implemented compactly, yet ensuring that each “parent” is always less
than its “children”. The best way to maintain a heap in Python is to
use a list
and have it managed by
the heapq
library module, as shown
in this recipe’s Solution. The list does not get fully sorted, yet you
can be sure that, whenever you heappop
an item from the list, you always
get the lowest item currently present, and all others will be adjusted
to ensure the heap property is still valid. Each addition with
heappush
, and each removal with
heappop
, takes a short time
proportional to the logarithm of the current length of the list
(O(log N)
, in common parlance). You
pay as you go, a little at a time (and not too much in total,
either.)
A good occasion to use this heap approach, for example, is when
you have a long-running queue with new data periodically arriving, and
you always want to be able to get the most important item off the
queue without having to constantly re-sort your data or perform full
searches. This concept is known as a priority
queue, and a heap is an excellent way to implement it. Note
that, intrinsically, the heapq
module supplies you with the smallest item at
each heappop
, so make sure to
arrange the way you encode your items’ priority values to reflect
this. For example, say that you receive incoming items each
accompanied by a cost, and the most important item at any time is the
one with the highest cost that is currently on the queue; moreover,
among items of equal cost, the most important one is the one that
arrived earliest. Here’s a way to build a “priority queue” class
respecting these specs and based on functions of module heapq
:
class prioq(object): def _ _init_ _(self): self.q = [ ] self.i = 0 def push(self, item, cost): heapq.heappush(self.q, (-cost, self.i, item)) self.i += 1 def pop(self): return heapq.heappop(self.q)[-1]
The main idea in this snippet is to push on the heap tuples whose first item is the cost with changed sign, so that higher costs result in smaller tuples (by Python’s natural comparison); right after the cost, we put a progressive index, so that, among items with equal cost, the one arriving earliest will be in a smaller tuple.
In Python 2.4, module heapq
has been reimplemented and optimized; see Recipe 5.8 for more
information about heapq
.
Docs for module heapq
in the
Library Reference and Python in a
Nutshell; heapq.py in
the Python sources contains a very interesting discussion of heaps;
Recipe 5.8 for more
information about heapq
; Recipe 19.14 for merging
sorted sequences using heapq
.
Credit: Matteo Dell’Amico, Raymond Hettinger, George Yoshida, Daniel Harding
You need to get just a few of the smallest items from a
sequence. You could sort the sequence and just use seq[:n]
, but is there any way you can do
better?
Perhaps you can do better, if n
, the
number of items you need, is small compared to
n
, the sequence’s length. sort
is very fast, but it still takes
O(n log n)
time, while we can get
the first n
smallest elements in time
O(n)
if
n
is small. Here is a simple and practical
generator for this purpose, which works equally well in Python 2.3 and
2.4:
import heapq def isorted(data): data = list(data) heapq.heapify(data) while data: yield heapq.heappop(data)
In Python 2.4 only, you can use an even simpler and faster way
to get the smallest n
items of
data
when you know
n
in advance:
import heapq def smallest(n, data): return heapq.nsmallest(n, data)
data
can be any bounded iterable; the recipe’s function
isorted
starts by calling list
on it to ensure that. You can remove
the statement data = list(data)
if
all the following conditions hold: you know that
data
is a list to start with, you don’t
mind the fact that the generator reorders
data
’s items, and you want to remove items
from data
as you fetch them.
As shown previously in Recipe 5.7, the Python
Standard Library contains module heapq
, which supports the data structures
known as heaps. Generator
isorted
in this recipe’s Solution relies on making a
heap at the start (via heap.heapify
) and then yielding and removing
the heap’s smallest remaining item at each step (via heap.heappop
).
In Python 2.4, module heapq
has also grown two new functions. heapq.nlargest(n
, data)
returns a list of the
n
largest items of data
;
heapq.nsmallest(n, data)
returns a
list of the n
smallest items. These
functions do not require that data
satisfy
the heap condition; indeed, they do not even require
data
to be a list—any bounded iterable
whose items are comparable will do. Function smallest
in this recipe’s Solution just lets heapq.smallest
do all the work.
To judge speed, we must always
measure it—guessing about relative speeds of different pieces of code
is a mug’s game. So, how does isorted
’s performance
compare with Python 2.4’s built-in function sorted
, when we’re only looping on the first
few (smallest) items? To help measure timing, I wrote a
top10
function that can use either
approach, and I also made sure I had a sorted
function even in Python 2.3, where
it’s not built in:
try: sorted except: def sorted(data): data = list(data) data.sort( ) return data import itertools def top10(data, howtosort): return list(itertools.islice(howtosort(data), 10))
On my machine running Python 2.4 on thoroughly shuffled lists of
1,000 integers, top10
takes about
260 microseconds with isorted
, while it takes about
850 microseconds with the built-in sorted
. However, Python 2.3 is much slower
and gives vastly different results: about 12 milliseconds with
isorted
, about 2.7 milliseconds with sorted
. In other words, Python 2.3 is 3
times slower than Python 2.4 for sorted
, but it’s 50
times slower for isorted
. Lesson to retain: whenever
you optimize, measure. You shouldn’t choose
optimizations based on first principles, since the performance numbers
can vary so widely, even between vastly compatible “point releases”. A
secondary point can be made: if you care about performance, move to
Python 2.4 as soon as you can. Python 2.4 has been vastly optimized
and accelerated over Python 2.3, particularly in areas related to
searching and sorting.
If you know that your code need only support Python 2.4, then,
as this recipe’s Solution indicates, using heapq
’s new function nsmallest
is faster, as well as simpler,
than doing your own coding. To implement top10
in
Python 2.4, for example, you just need:
import heapq def top10(data): return heapq.nsmallest(10, data)
This version takes about half the time of the previously shown
isorted
-based top10
, when called on the same thoroughly
shuffled lists of 1,000 integers.
Library Reference and Python
in a Nutshell docs about method sort
of type list
, and about modules heapq
and timeit
; Chapter 19 for more about
iteration in Python; Python in a Nutshell’s
chapter on optimization; heapq.py
in the Python sources contains a very interesting discussion of heaps;
Recipe 5.7 for more
information about heapq
.
Credit: Noah Spurrier
You need to look for a lot of items in a sequence.
If list L
is sorted, module
bisect
from the Python Standard
Library makes it easy to check if some item
x
is present in
L
:
import bisect x_insert_point = bisect.bisect_right(L, x) x_is_present = L[x_insert_point-1:x_insert_point] == [x]
Looking for an item x
in a
list L
is very easy in Python: to check
whether the item is there at all, if x in
L
; to find out where exactly it is, L.index(x)
. However, if
L
has length n
,
these operations take time proportional to
n
—essentially, they just loop over the
list’s items, checking each for equality to
x
. If L
is
sorted, we can do better.
The classic algorithm to look for an item in a sorted
sequence is known as binary search, because at
each step it roughly halves the range it’s still searching on—it
generally takes about log
2 n
steps.
It’s worth considering when you’re going to look for items many times,
so you can amortize the cost of sorting over many searches. Once
you’ve decided to use binary search for x
in L
, after calling L.sort( )
, module bisect
from the Python Standard Library
makes the job easy.
Specifically, we need function bisect.bisect_right
, which returns the index
where an item should be inserted, to keep the
sorted list sorted, but doesn’t alter the list; moreover, if the item
already appears in the list, bisect_right
returns an index that’s just to
the right of any items with the same value. So, after getting this
“insert point” by calling bisect.bisect_right(L, x)
, we need only to
check the list immediately before the insert
point, to see if an item equal to x
is
already there.
The way we compute x_is_present
in the “Solution” may not be
immediately obvious. If we know that L
is
not empty, we can use a simpler and more obvious approach:
x_is_present = L[x_insert_point-1] == x
However, the indexing in this simpler approach raises an
exception when L
is empty. When the slice boundaries
are invalid, slicing is less “strict” than indexing, since it just
produces an empty slice without raising any exception. In general,
somelist[i:i+1]
is the same
one-item list as [somelist[i]]
when
i
is a valid index in somelist
: it’s an empty list [ ]
when the indexing would raise IndexError
. The computation of x_is_present
in the recipe exploits this
important property to avoid having to deal with exceptions and handle
empty and nonempty cases for L
in one uniform way. An
alternative approach is:
x_is_present = L and L[x_insert_point-1] == x
This alternative approach exploits and
’s short-circuiting behavior to guard the
indexing, instead of using slicing.
An auxiliary dict
, as shown
in Recipe 5.12, is
also a possibility as long as items are hashable (meaning that items
can be used as keys into a dict
).
However, the approach in this recipe, based on a sorted list, may be
the only useful one when the items are comparable (otherwise, the list
could not be sorted) but not hashable (so a dict
can’t have those items as its
keys).
When the list is already sorted, and the number of items you
need to look up in it is not extremely large, it may in any case be
faster to use bisect
than to build
an auxiliary dictionary, since the investment of time in the latter
operation might not be fully amortized. This is particularly likely in
Python 2.4, since bisect
has been
optimized very effectively and is much faster than it was in Python
2.3. On my machine, for example, bisect.bisect_right
for an item in the
middle of a list of 10,000 integers is about four times faster in
Python 2.4 than it was in Python 2.3.
Documentation for the bisect
module in the Library Reference and
Python in a Nutshell; Recipe 5.12.
Credit: Raymond Hettinger, David Eppstein, Shane Holloway, Chris Perkins
You need to get from a sequence the
n
th item in rank order (e.g., the middle
item, known as the median). If the sequence was
sorted, you would just use seq[
n
]
. But
the sequence isn’t sorted, and you wonder if you
can do better than just sorting it first.
Perhaps you can do better, if the sequence is big, has been
shuffled enough, and comparisons between its items are costly. Sort is
very fast, but in the end (when applied to a thoroughly shuffled
sequence of length n
) it always takes
O(
n
log
n
)
time, while there exist
algorithms that can be used to get the n
th
smallest element in time O(
n
)
.
Here is a function with a solid implementation of such an
algorithm:
import random def select(data, n): " Find the nth rank ordered element (the least value has rank 0). " # make a new list, deal with <0 indices, check for valid index data = list(data) if n<0: n += len(data) if not 0 <= n < len(data): raise ValueError, "can't get rank %d out of %d" % (n, len(data)) # main loop, quicksort-like but with no need for recursion while True: pivot = random.choice(data) pcount = 0 under, over = [ ], [ ] uappend, oappend = under.append, over.append for elem in data: if elem < pivot: uappend(elem) elif elem > pivot: oappend(elem) else: pcount += 1 numunder = len(under) if n < numunder: data = under elif n < numunder + pcount: return pivot else: data = over n -= numunder + pcount
This recipe is meant for cases in which repetitions
count. For example, the median of the list
[1, 1, 1, 2, 3]
is 1
because that is the third one of the five
items in rank order. If, for some strange reason, you want to discount
duplications, you need to reduce the list to its unique items first
(e.g., by applying the Recipe 18.1), after which
you may want to come back to this recipe.
Input argument data
can be
any bounded iterable; the recipe starts by calling list
on it to ensure that. The algorithm
then loops, implementing at each leg a few key ideas: randomly
choosing a pivot element; slicing up the list
into two parts, made up of the items that are “under” and “over” the
pivot respectively; continuing work for the next leg on just one of
the two parts, since we can tell which one of them the
n
th element
will be in, and the other part can safely be ignored. The ideas are
very close to that in the classic algorithm known as
quicksort (except that quicksort cannot ignore
either part, and thus must use recursion, or recursion-removal
techniques such as keeping an explicit stack, to make sure it deals
with both parts).
The random choice of pivot
makes the algorithm
robust against unfavorable data orderings (the kind that wreak havoc
with naive quicksort); this implementation decision costs about
log
2N calls
to random.choice
. Another
implementation issue worth pointing out is that the recipe counts the
number of occurrences of the pivot: this precaution ensures good
performance even in the anomalous case where
data
contains a high number of repetitions
of identical values.
Extracting the bound methods .append
of lists under
and
over
as local variables uappend
and
oappend
may look like a pointless, if tiny,
complication, but it is, in fact, a very important optimization
technique in Python. To keep the compiler simple, straightforward,
unsurprising, and robust, Python does not hoist
constant computations out of loops, nor does it “cache” the results of
method lookup. If you call under.append
and
over.append
in the inner loop, you pay the cost of
lookup each and every time. If you want something hoisted, hoist it
yourself. When you’re considering an optimization, you should always
measure the code’s performance with and
without that optimization, to check that the
optimization does indeed make an important difference. According to my
measurements, removing this single optimization slows performance down
by about 50% for the typical task of picking the
5000th item of range(10000)
. Considering the tiny amount of
complication involved, a difference of 50% is well worth it.
A natural idea for optimization, which just didn’t make the
grade once carefully measured, is to call cmp(elem, pivot)
in the loop body, rather
than making separate tests for elem <
pivot
and elem >
pivot
. Unfortunately, measurement shows that cmp
doesn’t speed things up; in fact, it
slows them down, at least when the items of
data
are of elementary types such as
numbers and strings.
So, how does select
’s performance compare with
the simpler alternative of:
def selsor(data, n): data = list(data) data.sort( ) return data[n]
On thoroughly shuffled lists of 3,001 integers on my machine,
this recipe’s select
takes about 16 milliseconds to
find the median, while selsor
takes about 13
milliseconds; considering that sort
could take advantage of any partial sortedness in the data, for this
kind of length, and on elementary data whose comparisons are fast,
it’s not to your advantage to use select
. For a
length of 30,001, performance becomes very close between the two
approaches—around 170 milliseconds either way. When you push the
length all the way to 300,001, select
provides an
advantage, finding the median in about 2.2 seconds, while
selsor
takes about 2.5.
The break-even point will be smaller if the items in the
sequence have costly comparison methods, since the key difference
between the two approaches is in the number of comparisons
performed—select
takes O(n)
, selsor
takes O(n log n)
. For example, say we need to
compare instances of a class designed for somewhat costly comparisons
(simulating four-dimensional points that will often coincide on the
first few dimensions):
class X(object): def _ _init_ _(self): self.a = self.b = self.c = 23.51 self.d = random.random( ) def _dats(self): return self.a, self.b, self.c, self.d def _ _cmp_ _(self, oth): return cmp(self._dats, oth._dats)
Here, select
already becomes faster than
selsor
when what we’re computing is the median of
vectors of 201 such instances.
In other words, although select
has more
general overhead, when compared to the wondrously efficient coding of
lists’ sort
method, nevertheless,
if n
is large enough and each comparison is
costly enough, select
is still well worth
considering.
Library Reference and Python
in a Nutshell docs about method sort
of type list
, and about module random
.
Credit: Nathaniel Gray, Raymond Hettinger, Christophe Delord, Jeremy Zucker
You need to show that Python’s support for the functional programming paradigm is better than it might seem at first sight.
Functional programming languages, of which Haskell is a great example, are splendid animals, but Python can hold its own in such company:
def qsort(L): if len(L) <= 1: return L return qsort([lt for lt in L[1:] if lt < L[0]]) + L[0:1] + qsort([ge for ge in L[1:] if ge >= L[0]])
In my humble opinion, this code is almost as pretty as the Haskell version from http://www.haskell.org:
qsort [ ] = [ ] qsort (x:xs) = qsort elts_lt_x ++ [x] ++ qsort elts_greq_x where elts_lt_x = [y | y <- xs, y < x] elts_greq_x = [y | y <- xs, y >= x]
Here’s a test function for the Python version:
def qs_test(length): import random joe = range(length) random.shuffle(joe) qsJoe = qsort(joe) for i in range(len(qsJoe)): assert qsJoe[i] == i, 'qsort is broken at %d!' %i
This rather naive implementation of quicksort
illustrates the expressive power of list comprehensions. Do not use
this approach in real code! Python lists have an in-place sort
method that is much faster and should
always be preferred; in Python 2.4, the new built-in function sorted
accepts any finite sequence and
returns a new sorted list with the sequence’s items. The only proper
use of this recipe is for impressing friends, particularly ones who
(quite understandably) are enthusiastic about functional programming,
and particularly about the Haskell language.
I cooked up this function after finding the wonderful Haskell quicksort (which I’ve reproduced in the “Solution”) at http://www.haskell.org/aboutHaskell.html. After marveling at the elegance of this code for a while, I realized that list comprehensions made the same approach possible in Python. Not for nothing did we steal list comprehensions right out of Haskell, just Pythonizing them a bit by using keywords rather than punctuation!
Both implementations pivot on the first element of the list and
thus have worst-case O(n)
performance for the very common case of sorting an already sorted
list. You would never want to do so in production code! Because this
recipe is just a propaganda piece, though, it doesn’t really
matter.
You can write a less compact version with similar architecture in order to use named local variables and functions for enhanced clarity:
def qsort(L): if not L: return L pivot = L[0] def lt(x): return x<pivot def ge(x): return x>=pivot return qsort(filter(lt, L[1:]))+[pivot]+qsort(filter(ge, L[1:]))
Once you start going this route, you can easily move to a slightly less naive version, using random pivot selection to make worst-case performance less likely and counting pivots to handle degenerate case with many equal elements:
import random def qsort(L): if not L: return L pivot = random.choice(L) def lt(x): return x<pivot def gt(x): return x>pivot return qsort(filter(lt, L))+[pivot]*L.count(pivot)+qsort(filter(gt, L))
Despite the enhancements, they are meant essentially for fun and demonstration purposes. Production-quality sorting code is quite another thing: these little jewels, no matter how much we dwell on them, will never match the performance and solidity of Python’s own built-in sorting approaches.
Rather than going for clarity and robustness, we can move in the
opposite direction to make this last point most obvious, showing off
the obscurity and compactness that one can get with Python’s lambda
:
q=lambda x:(lambda o=lambda s:[i for i in x if cmp(i,x[0])==s]: len(x)>1 and q(o(-1))+o(0)+q(o(1)) or x)( )
At least, with this beauty (a single
logical line, although it needs to be split into two physical lines
due to its length), it should be absolutely obvious that this approach
is not meant for real-world use. The equivalent, using more readable
def
statements rather than opaque
lambda
s, would still be pretty
obscure:
def q(x): def o(s): return [i for i in x if cmp(i,x[0])==s] return len(x)>1 and q(o(-1))+o(0)+q(o(1)) or x
but a little more clarity (and sanity) can be recovered by
opening up the pithy len(x)>1 and . . . or
x
into an if
/else
statement and introducing sensible
local names again:
def q(x): if len(x)>1: lt = [i for i in x if cmp(i,x[0]) == -1 ] eq = [i for i in x if cmp(i,x[0]) == 0 ] gt = [i for i in x if cmp(i,x[0]) == 1 ] return q(lt) + eq + q(gt) else: return x
Fortunately, in the real world, Pythonistas are much too
sensible to write convoluted, lambda
-filled horrors such as this. In fact,
many (though admittedly not all) of us feel enough aversion to
lambda
itself (partly from having
seen it abused this way) that we go out of our way to use readable
def
statements instead. As a
result, the ability to decode such “bursts of line noise” is
not a necessary survival skill in the Python
world, as it might be for other languages. Any
language feature can be abused by programmers trying to be “clever” .
. . as a result, some Pythonistas (though a minority) feel a similar
aversion to features such as list comprehensions (since it’s possible
to cram too many things into a list comprehension, where a plain
for
loop would be clearer) or to
the short-circuiting behavior of operators and
/or
(since they can be abused to write obscure, terse expressions where a
plain if
statement would be
clearer).
The Haskell web site, http://www.haskell.org.
Credit: Alex Martelli
You need to perform frequent tests for membership in a
sequence. The O(n)
behavior of
repeated in
operators hurts
performance, but you can’t switch to using just a dictionary or set
instead of the sequence, because you also need to
keep the sequence’s order.
Say you need to append items to a list only if they’re not already in the list. One sound approach to this task is the following function:
def addUnique(baseList, otherList): auxDict = dict.fromkeys(baseList) for item in otherList: if item not in auxDict: baseList.append(item) auxDict[item] = None
If your code has to run only under Python 2.4, you can get exactly the same effect with an auxiliary set rather than an auxiliary dictionary.
A simple (naive?) approach to this recipe’s task looks good:
def addUnique_simple(baseList, otherList): for item in otherList: if item not in baseList: baseList.append(item)
and it may be sort of OK, if the lists are very small.
However, the simple approach can be quite slow if the lists are
not small. When you check if item not in
baseList
, Python can implement the in
operator in only one way: an internal
loop over the elements of baseList
, ending with a
result of True
as soon as an
element compares equal to item
, with a result of
False
if the loop terminates
without having found any equality. On average, executing the in
-operator takes time proportional to
len(baseList)
.
addUnique_simple
executes the in
-operator len(otherList)
times, so, in all, it takes
time proportional to the product of the lengths
of the two lists.
In the addUnique
function shown in the
“Solution”, we first build the auxiliary dictionary
auxDict
, a step that takes time proportional to
len(baseList)
. Then, the in
-operator inside the loop checks for
membership in a dict
—a step that
makes all the difference because checking for membership in a dict
takes roughly constant time,
independent of the number of items in the dict
! So, the for
loop takes time proportional to len(otherList)
, and the entire function
takes time proportional to the sum of the lengths
of the two lists.
The analysis of the running times should in fact go quite a bit
deeper, because the length of baseList
is not
constant in addUnique_simple; baseList
grows each
time an item
is processed that was not already there.
But the gist of the (surprisingly complicated) analysis is not very
different from what this simplified version indicates. We can check
this by measuring. When each list holds 10 integers, with an overlap
of 50%, the simple version is about 30% slower than the one shown in
the “Solution”, the kind of slowdown that can normally be ignored. But
with lists of 100 integers each, again with 50% overlap, the simple
version is twelve times slower than the one shown
in the “Solution”—a level of slowdown that can never be ignored, and
it only gets worse if the lists get really substantial.
Sometimes, you could obtain even better overall performance for
your program by permanently placing the auxiliary dict
alongside the sequence, encapsulating
both into one object. However, in this case, you must maintain the
dict
as the sequence gets modified,
to ensure it stays in sync with the sequence’s current membership.
This maintenance task is not trivial, and it can be architected in
many different ways. Here is one such way, which does the syncing
“just in time,” rebuilding the auxiliary dict
when a membership test is required and
the dictionary is possibly out of sync with the list’s contents. Since
it costs very little, the following class optimizes the index
method, as well as membership
tests:
class list_with_aux_dict(list): def _ _init_ _(self, iterable=( )): list._ _init_ _(self, iterable) self._dict_ok = False def _rebuild_dict(self): self._dict = { } for i, item in enumerate(self): if item not in self._dict: self._dict[item] = i self._dict_ok = True def _ _contains_ _(self, item): if not self._dict_ok: self._rebuild_dict( ) return item in self._dict def index(self, item): if not self._dict_ok: self._rebuild_dict( ) try: return self._dict[item] except KeyError: raise ValueError def _wrapMutatorMethod(methname): _method = getattr(list, methname) def wrapper(self, *args): # Reset 'dictionary OK' flag, then delegate to the real mutator method self._dict_ok = False return _method(self, *args) # in Python 2.4, only: wrapper._ _name_ _ = _method._ _name_ _ setattr(list_with_aux_dict, methname, wrapper) for meth in 'setitem delitem setslice delslice iadd'.split( ): _wrapMutatorMethod('_ _%s_ _' % meth) for meth in 'append insert pop remove extend'.split( ): _wrapMutatorMethod(meth) del _wrapMethod # remove auxiliary function, not needed any more
The list_with_aux_dict
class extends list
and delegates to it every method,
except _ _contains_ _
and index
. Every method that can modify list
membership is wrapped in a closure that resets a flag asserting that
the auxiliary dictionary is OK. Python’s in
-operator calls the _ _contains_ _
method.
list_with_aux_dict
’s _
_contains_ _
method rebuilds the auxiliary dictionary,
unless the flag is set (when the flag is set, rebuilding is
unnecessary); the index
method
works the same way.
Instead of building and installing wrapping closures for all the
mutating methods of the list into the
list_with_aux_dict
class with a helper function, as
the recipe does, we could write all the def
statements for the wrapper methods in
the body of list_with_aux_dict
. However, the code for
the class as presented has the important advantage of minimizing
boilerplate (repetitious plumbing code that is boring and voluminous,
and thus a likely home for bugs). Python’s strengths at introspection
and dynamic modification give you a choice: you can build method
wrappers, as this recipe does, in a smart and concise way; or, you can
choose to code the boilerplate anyway, if you prefer to avoid what
some would call the black magic of introspection and dynamic
modification of class objects.
The architecture of class list_with_aux_dict
caters well to a rather common pattern of use, where
sequence-modifying operations happen in bunches, followed by a period
of time in which the sequence is not modified, but several membership
tests may be performed. However, the addUnique_simple
function shown earlier would not get any performance benefit if
argument baseList
was an instance of this
recipe’s list_with_aux_dict
rather than a plain
list
: the function interleaves
membership tests and sequence modifications. Therefore, too many
rebuilds of the auxiliary dictionary for
list_with_aux_dict
would impede the function’s
performance. (Unless a typical case was for a vast majority of the
items of otherList
to be already contained
in baseList
, so that very few modifications
occurred compared to the number of membership tests.)
An important requisite for any of these membership-test
optimizations is that the values in the sequence must be hashable
(otherwise, of course, they cannot be keys in a dict
, nor items in a set
). For example, a list of tuples might be
subjected to this recipe’s treatment, but for a list of lists, the
recipe as it stands is just not applicable.
The Library Reference and Python in a Nutshell sections on sequence types and mapping types.
Credit: David Eppstein, Alexander Semenov
If the sequences are strings (plain or Unicode), Python
strings’ find
method and the
standard library’s re
module are
the best approach. Otherwise, use the Knuth-Morris-Pratt algorithm
(KMP):
def KnuthMorrisPratt(text, pattern): ''' Yields all starting positions of copies of subsequence 'pattern' in sequence 'text' -- each argument can be any iterable. At the time of each yield, 'text' has been read exactly up to and including the match with 'pattern' that is causing the yield. ''' # ensure we can index into pattern, and also make a copy to protect # against changes to 'pattern' while we're suspended by `yield' pattern = list(pattern) length = len(pattern) # build the KMP "table of shift amounts" and name it 'shifts' shifts = [1] * (length + 1) shift = 1 for pos, pat in enumerate(pattern): while shift <= pos and pat != pattern[pos-shift]: shift += shifts[pos-shift] shifts[pos+1] = shift # perform the actual search startPos = 0 matchLen = 0 for c in text: while matchLen == length or matchLen >= 0 and pattern[matchLen] != c: startPos += shifts[matchLen] matchLen -= shifts[matchLen] matchLen += 1 if matchLen == length: yield startPos
This recipe implements the Knuth-Morris-Pratt algorithm for finding copies of a given pattern as a contiguous subsequence of a larger text. Since KMP accesses the text sequentially, it is natural to implement it in a way that allows the text to be an arbitrary iterator. After a preprocessing stage that builds a table of shift amounts and takes time that’s directly proportional to the length of the pattern, each text symbol is processed in constant amortized time. Explanations and demonstrations of how KMP works can be found in all good elementary texts about algorithms. (A recommendation is provided in See Also.)
If text
and
pattern
are both Python strings, you can
get a faster solution by suitably applying Python built-in search
methods:
def finditer(text, pattern): pos = -1 while True: pos = text.find(pattern, pos+1) if pos < 0: break yield pos
For example, using an alphabet of length 4 ('ACGU
' . . .), finding all occurrences of a
pattern of length 8 in a text of length 100000, on my machine, takes
about 4.3 milliseconds with finditer
, but the same task takes about 540
milliseconds with KnuthMorrisPratt
(that’s with
Python 2.3; KMP is faster with Python 2.4, taking about 480
milliseconds, but that’s still over 100 times slower than finditer
). So remember: this recipe is
useful for searches on generic sequences,
including ones that you cannot keep in memory all at once, but if
you’re searching on strings, Python’s built-in searching methods
rule.
Many excellent books cover the fundamentals of algorithms; among such books, a widely admired one is Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein, Introduction to Algorithms, 2d ed. (MIT Press).
Credit: Dmitry Vasiliev, Alex Martelli
You want to use a dictionary to store the mapping
between some keys and a current score value for each key. You
frequently need to access the keys and scores in natural order
(meaning, in order of ascending scores) and to check on a “key"’s
current ranking in that order, so that using just a dict
isn’t quite enough.
We can subclass dict
and add
or override methods as needed. By using multiple inheritance, placing
base UserDict.DictMixin
before base dict
and carefully arranging our various
delegations and “over"rides, we can achieve a good balance between
getting good performance and avoiding the need to write “boilerplate”
code.
By enriching our class with many examples in its
docstring, we can use the standard library’s module doctest
to give us unit-testing
functionality, as well as ensuring the accuracy of all the examples we
write in the docstring:
#!/usr/bin/env python ''' An enriched dictionary that holds a mapping from keys to scores ''' from bisect import bisect_left, insort_left import UserDict class Ratings(UserDict.DictMixin, dict): """ class Ratings is mostly like a dictionary, with extra features: the value corresponding to each key is the 'score' for that key, and all keys are ranked in terms their scores. Values must be comparable; keys, as well as being hashable, must be comparable if any two keys may ever have the same corresponding value (i.e., may be "tied" on score). All mapping-like behavior is just as you would expect, such as: >>> r = Ratings({"bob": 30, "john": 30}) >>> len(r)2
>>> r.has_key("paul"), "paul" in r(False, False)
>>> r["john"] = 20 >>> r.update({"paul": 20, "tom": 10}) >>> len(r)4
>>> r.has_key("paul"), "paul" in r(True, True)
>>> [r[key] for key in ["bob", "paul", "john", "tom"]][30, 20, 20, 10]
>>> r.get("nobody"), r.get("nobody", 0)(None, 0)
In addition to the mapping interface, we offer rating-specific methods. r.rating(key) returns the ranking of a "key" in the ratings, with a ranking of 0 meaning the lowest score (when two keys have equal scores, the keys themselves are compared, to "break the tie", and the lesser key gets a lower ranking): >>> [r.rating(key) for key in ["bob", "paul", "john", "tom"]][3, 2, 1, 0]
getValueByRating(ranking) and getKeyByRating(ranking) return the score and key, respectively, for a given ranking index: >>> [r.getValueByRating(rating) for rating in range(4)][10, 20, 20, 30]
>>> [r.getKeyByRating(rating) for rating in range(4)]['tom', 'john', 'paul', 'bob']
An important feature is that the keys( ) method returns keys in ascending order of ranking, and all other related methods return lists or iterators fully consistent with this ordering: >>> r.keys( )['tom', 'john', 'paul', 'bob']
>>> [key for key in r]['tom', 'john', 'paul', 'bob']
>>> [key for key in r.iterkeys( )]['tom', 'john', 'paul', 'bob']
>>> r.values( )[10, 20, 20, 30]
>>> [value for value in r.itervalues( )][10, 20, 20, 30]
>>> r.items( )[('tom', 10), ('john', 20), ('paul', 20), ('bob', 30)]
>>> [item for item in r.iteritems( )][('tom', 10), ('john', 20), ('paul', 20), ('bob', 30)]
An instance can be modified (adding, changing and deleting key-score correspondences), and every method of that instance reflects the instance's current state at all times: >>> r["tom"] = 100 >>> r.items( )[('john', 20), ('paul', 20), ('bob', 30), ('tom', 100)]
>>> del r["paul"] >>> r.items( )[('john', 20), ('bob', 30), ('tom', 100)]
>>> r["paul"] = 25 >>> r.items( )[('john', 20), ('paul', 25), ('bob', 30), ('tom', 100)]
>>> r.clear( ) >>> r.items( )[ ]
""" ''' the implementation carefully mixes inheritance and delegation to achieve reasonable performance while minimizing boilerplate, and, of course, to ensure semantic correctness as above. All mappings' methods not implemented below get inherited, mostly from DictMixin, but, crucially!, _ _getitem_ _ from dict. ''' def _ _init_ _(self, *args, **kwds): ''' This class gets instantiated just like 'dict' ''' dict._ _init_ _(self, *args, **kwds) # self._rating is the crucial auxiliary data structure: a list # of all (value, key) pairs, kept in "natural"ly-sorted order self._rating = [ (v, k) for k, v in dict.iteritems(self) ] self._rating.sort( ) def copy(self): ''' Provide an identical but independent copy ''' return Ratings(self) def _ _setitem_ _(self, k, v): ''' besides delegating to dict, we maintain self._rating ''' if k in self: del self._rating[self.rating(k)] dict._ _setitem_ _(self, k, v) insort_left(self._rating, (v, k)) def _ _delitem_ _(self, k): ''' besides delegating to dict, we maintain self._rating ''' del self._rating[self.rating(k)] dict._ _delitem_ _(self, k) ''' delegate some methods to dict explicitly to avoid getting DictMixin's slower (though correct) implementations instead ''' _ _len_ _ = dict._ _len_ _ _ _contains_ _ = dict._ _contains_ _ has_key = _ _contains_ _ ''' the key semantic connection between self._rating and the order of self.keys( ) -- DictMixin gives us all other methods 'for free', although we could implement them directly for slightly better performance. ''' def _ _iter_ _(self): for v, k in self._rating: yield k iterkeys = _ _iter_ _ def keys(self): return list(self) ''' the three ratings-related methods ''' def rating(self, key): item = self[key], key i = bisect_left(self._rating, item) if item == self._rating[i]: return i raise LookupError, "item not found in rating" def getValueByRating(self, rating): return self._rating[rating][0] def getKeyByRating(self, rating): return self._rating[rating][1] def _test( ): ''' we use doctest to test this module, which must be named rating.py, by validating all the examples in docstrings. ''' import doctest, rating doctest.testmod(rating) if _ _name_ _ == "_ _main_ _": _test( )
In many ways, a dictionary is the natural data structure for storing a correspondence between keys (e.g., names of contestants in a competition) and the current “score” of each key (e.g., the number of points a contestant has scored so far, or the highest bid made by each contestant at an auction, etc.). If we use a dictionary for such purposes, we will probably want to access it often in natural order—the order in which the keys’ scores are ascending—and we’ll also want fast access to the rankings (ratings) implied by the current “score"s (e.g., the contestant currently in third place, the score of the contestant who is in second place, etc.).
To achieve these purposes, this recipe subclasses
dict
to add the needed
functionality that is completely missing from dict
(methods rating
,
getValueByRating
, getKeyByRating
),
and, more subtly and crucially, to modify method keys
and all other related methods so that
they return lists or iterators with the required order (i.e., the
order in which scores are ascending; if we have to break ties when two
keys have the same score, we implicitly compare the keys themselves).
Most of the detailed documentation is in the docstring of the class
itself—a crucial issue because by keeping the documentation and
examples there, we can use module doctest
from the Python Standard Library to
provide unit-testing functionality, as well as ensuring that our
examples are correct.
The most interesting aspect of the implementation is that it
takes good care to minimize boilerplate (meaning repetitious and
boring code, and therefore code where bugs are most likely to hide)
without seriously impairing performance. class Ratings
multiply inherits from
dict
and
DictMixin
, with the latter placed
first in the list of bases, so that all methods
come from the mixin, if it provides them, unless explicitly overridden
in the class.
Raymond Hettinger’s DictMixin
class was originally posted as a recipe to the online version of the
Python Cookbook and later
became part of Python 2.3’s standard library. DictMixin
provides all the methods of a
mapping except _ _init_ _
, copy
, and the four fundamental methods:
_ _getitem_ _
, _ _setitem_ _
, _
_delitem_ _
, and, last but not least, keys
. If you are coding a mapping class and
want to ensure that your class supports all of the many methods that a
full mapping provides to application code, you should subclass
DictMixin
and supply at least the
fundamental methods (depending on your class’ semantics—e.g., if your
class has immutable instances, you need not supply the mutator methods
_ _setitem_ _
and _ _delitem_ _
). You may optionally implement
other methods for performance purposes, overriding the implementation
that DictMixin
provides. The whole
DictMixin
architecture can be seen
as an excellent example of the classic Template Method Design Pattern,
applied pervasively in a useful mix-in variant.
In this recipe’s class, we inherit _
_getitem_ _
from our other base (namely, the built-in type
dict
), and we also delegate
explicitly to dict
everything we
can for performance reasons. We have to code the elementary mutator
methods (_ _setitem_ _
and _ _delitem_ _
) because, in addition to
delegating to our base class dict
,
we need to maintain our auxiliary data structure
self._rating
—a list of (score, key)
pairs that we keep in sorted
order with the help of standard library module bisect
. We implement keys
ourselves (and while we’re at it, we
implement _ _iter_ _
—i.e.,
iterkeys
as well, since clearly
keys
is easiest to implement by
using _ _iter_ _
) to exploit
self._rating
and return the keys in the order we
need. Finally, we add the obvious implementations for _ _init_ _
and copy
, in addition to the three,
ratings-specific methods that we supply.
The result is quite an interesting example of balancing
concision, clarity, and well-advised reuse of the enormous amount of
functionality that the standard Python library places at our disposal.
If you use this module in your applications, profiling may reveal that
a method that this recipe’s class inherits from DictMixin
has somewhat unsatisfactory
performance—after all, the implementations in DictMixin
are, of necessity, somewhat
generic. If this is the case, by all means add a direct implementation
of whatever further methods you need to achieve maximum performance!
For example, if your application performs a lot of looping on the
result of calling r.iteritems( )
for some instance r
of class
Ratings
, you may get slightly better performance by
adding to the body of the class the direct implementation of the
method:
def iteritems(self): for v, k in self._rating: yield k, v
Library Reference and Python
in a Nutshell documentation about class DictMixin
in module UserDict
, and about module bisect
.
Credit: Brett Cannon, Amos Newcombe
You want to write a directory for a group of people, and you want that directory to be grouped by the initials of their last names and sorted alphabetically.
Python 2.4’s new itertools.groupby
function makes this task
easy:
import itertools def groupnames(name_iterable): sorted_names = sorted(name_iterable, key=_sortkeyfunc) name_dict = { } for key, group in itertools.groupby(sorted_names, _groupkeyfunc): name_dict[key] = tuple(group) return name_dict pieces_order = { 2: (-1, 0), 3: (-1, 0, 1) } def _sortkeyfunc(name): ''' name is a string with first and last names, and an optional middle name or initial, separated by spaces; returns a string in order last-first-middle, as wanted for sorting purposes. ''' name_parts = name.split( ) return ' '.join([name_parts[n] for n in pieces_order[len(name_parts)]]) def _groupkeyfunc(name): ''' returns the key for grouping, i.e. the last name's initial. ''' return name.split( )[-1][0]
In this recipe, name_iterable
must be an
iterable whose items are strings containing names in the form first -
middle - last, with middle being optional and the parts separated by
whitespace. The result of calling groupnames
on such
an iterable is a dictionary whose keys are the last names’ initials,
and the corresponding values are the tuples of all names with that
last name’s initial.
Auxiliary function _sortkeyfunc
splits
a name that’s a single string, either “first last” or “first middle
last,” and reorders the part into a list that starts with the last
name, followed by first name, plus the middle name or initial, if any,
at the end. Then, the function returns this list rejoined into a
string. The resulting string is the key we want to use for sorting,
according to the problem statement. Python 2.4’s built-in function
sorted
takes just this kind of
function (to call on each item to get the sort key) as the value of
its optional parameter named key
.
Auxiliary function _groupkeyfunc
takes
a name in the same form and returns the last name’s initial—the key on
which, again according to the problem statement, we want to
group.
This recipe’s primary function, groupnames
,
uses the two auxiliary functions and Python 2.4’s sorted
and itertools.groupby
to solve our problem,
building and returning the required dictionary.
If you need to code this task in Python 2.3, you can use the
same two support functions and recode function
groupnames
itself. In 2.3, it is more convenient to
do the grouping first and the sorting separately on each group, since
no groupby
function is available in
Python 2.3’s standard library:
def groupnames(name_iterable): name_dict = { } for name in name_iterable: key = _groupkeyfunc(name) name_dict.setdefault(key, [ ]).append(name) for k, v in name_dict.iteritems( ): aux = [(_sortkeyfunc(name), name) for name in v] aux.sort( ) name_dict[k] = tuple([ n for _ _, n in aux ]) return name_dict
Recipe 19.21;
Library Reference (Python 2.4) docs on module
itertools
.