Credit: Fred L. Drake, Jr., PythonLabs
Text-processing applications form a substantial part of the application space for any scripting language, if only because everyone can agree that text processing is useful. Everyone has bits of text that need to be reformatted or transformed in various ways. The catch, of course, is that every application is just a little bit different from every other application, so it can be difficult to find just the right reusable code to work with different file formats, no matter how similar they are.
Sounds like an easy question, doesn’t it? After all, we know it when we see it, don’t we? Text is a sequence of characters, and it is distinguished from binary data by that very fact. Binary data, after all, is a sequence of bytes.
Unfortunately, all data enters our applications as a sequence of bytes. There’s no library function we can call that will tell us whether a particular sequence of bytes represents text, although we can create some useful heuristics that tell us whether data can safely (not necessarily correctly) be handled as text. Recipe 1.11 shows just such a heuristic.
Python strings are immutable sequences of bytes or characters. Most of the ways we create and process strings treat them as sequences of characters, but many are just as applicable to sequences of bytes. Unicode strings are immutable sequences of Unicode characters: transformations of Unicode strings into and from plain strings use codecs (coder-decoders) objects that embody knowledge about the many standard ways in which sequences of characters can be represented by sequences of bytes (also known as encodings and character sets). Note that Unicode strings do not serve double duty as sequences of bytes. Recipe 1.20, Recipe 1.21, and Recipe 1.22 illustrate the fundamentals of Unicode in Python.
Okay, let’s assume that our application knows from the context that it’s looking at text. That’s usually the best approach because that’s where external input comes into play. We’re looking at a file either because it has a well-known name and defined format (common in the “Unix” world) or because it has a well-known filename extension that indicates the format of the contents (common on Windows). But now we have a problem: we had to use the word format to make the previous paragraph meaningful. Wasn’t text supposed to be simple?
Let’s face it: there’s no such thing as “pure” text, and if there were, we probably wouldn’t care about it (with the possible exception of applications in the field of computational linguistics, where pure text may indeed sometimes be studied for its own sake). What we want to deal with in our applications is information contained in text. The text we care about may contain configuration data, commands to control or define processes, documents for human consumption, or even tabular data. Text that contains configuration data or a series of commands usually can be expected to conform to a fairly strict syntax that can be checked before relying on the information in the text. Informing the user of an error in the input text is typically sufficient to deal with things that aren’t what we were expecting.
Documents intended for humans tend to be simple, but they vary widely in detail. Since they are usually written in a natural language, their syntax and grammar can be difficult to check, at best. Different texts may use different character sets or encodings, and it can be difficult or even impossible to tell which character set or encoding was used to create a text if that information is not available in addition to the text itself. It is, however, necessary to support proper representation of natural-language documents. Natural-language text has structure as well, but the structures are often less explicit in the text and require at least some understanding of the language in which the text was written. Characters make up words, which make up sentences, which make up paragraphs, and still larger structures may be present as well. Paragraphs alone can be particularly difficult to locate unless you know what typographical conventions were used for a document: is each line a paragraph, or can multiple lines make up a paragraph? If the latter, how do we tell which lines are grouped together to make a paragraph? Paragraphs may be separated by blank lines, indentation, or some other special mark. See Recipe 19.10 for an example of reading a text file as a sequence of paragraphs separated by blank lines.
Tabular data has many issues that are similar to the problems associated with natural-language text, but it adds a second dimension to the input format: the text is no longer linear—it is no longer a sequence of characters, but rather a matrix of characters from which individual blocks of text must be identified and organized.
As with any other data format, we need to do different things with text at different times. However, there are still three basic operations:
Parsing the data into a structure internal to our application
Transforming the input into something similar in some way, but with changes of some kind
Generating completely new data
Parsing can be performed in a variety of ways, and many
formats can be suitably handled by ad hoc parsers that deal
effectively with a very constrained format. Examples of this approach
include parsers for RFC 2822-style email headers (see the rfc822
module in Python’s standard library)
and the configuration files handled by the ConfigParser
module. The netrc
module offers another example of a
parser for an application-specific file format, this one based on the
shlex
module. shlex
offers a fairly typical tokenizer for
basic languages, useful in creating readable configuration files or
allowing users to enter commands to an interactive prompt. These sorts
of ad hoc parsers are abundant in Python’s standard library, and
recipes using them can be found in Chapter 2 and Chapter 13. More formal parsing
tools are also available for Python; they depend on larger add-on
packages and are surveyed in the introduction to Chapter 16.
Transforming text from one format to another is more interesting when viewed as text processing, which is what we usually think of first when we talk about text. In this chapter, we’ll take a look at some ways to approach transformations that can be applied for different purposes. Sometimes we’ll work with text stored in external files, and other times we’ll simply work with it as strings in memory.
The generation of textual data from application-specific data
structures is most easily performed using Python’s print
statement or the write
method of a file or file-like object.
This is often done using a method of the application object or a
function, which takes the output file as a parameter. The function can
then use statements such as these:
print >>thefile, sometext thefile.write(sometext)
which generate output to the appropriate file. However, this
isn’t generally thought of as text processing, as here there is no
input text to be processed. Examples of using both print
and write
can of course be found throughout this
book.
Working with text stored as a string in memory can be easy when the text is not too large. Operations that search the text can operate over multiple lines very easily and quickly, and there’s no need to worry about searching for something that might cross a buffer boundary. Being able to keep the text in memory as a simple string makes it very easy to take advantage of the built-in string operations available as methods of the string object.
File-based transformations deserve special treatment, because there can be substantial overhead related to I/O performance and the amount of data that must actually be stored in memory. When working with data stored on disk, we often want to avoid loading entire files into memory, due to the size of the data: loading an 80 MB file into memory should not be done too casually! When our application needs only part of the data at a time, working on smaller segments of the data can yield substantial performance improvements, simply because we’ve allowed enough space for our program to run. If we are careful about buffer management, we can still maintain the performance advantage of using a small number of relatively large disk read and write operations by working on large chunks of data at a time. File-related recipes are found in Chapter 12 .
Another interesting source for textual data comes to
light when we consider the network. Text is often retrieved from the
network using a socket. While we can always view a socket as a file
(using the makefile
method of the
socket object), the data that is retrieved over a socket may come in
chunks, or we may have to wait for more data to arrive. The textual
data may not consist of all data until the end of the data stream, so
a file object created with makefile
may not be entirely appropriate to pass to text-processing code. When
working with text from a network connection, we often need to read the
data from the connection before passing it along for further
processing. If the data is large, it can be handled by saving it to a
file as it arrives and then using that file when performing
text-processing operations. More elaborate solutions can be built when
the text processing needs to be started before all the data is
available. Examples of parsers that are useful in such situations may
be found in the htmllib
and
HTMLParser
modules in the standard
library.
The main tool Python gives us to process text is strings—immutable sequences of characters. There are actually two kinds of strings: plain strings, which contain 8-bit (ASCII) characters; and Unicode strings, which contain Unicode characters. We won’t deal much with Unicode strings here: their functionality is similar to that of plain strings, except each character takes up 2 (or 4) bytes, so that the number of different characters is in the tens of thousands (or even billions), as opposed to the 256 different characters that make up plain strings. Unicode strings are important if you must deal with text in many different alphabets, particularly Asian ideographs. Plain strings are sufficient to deal with English or any of a limited set of non-Asian languages. For example, all western European alphabets can be encoded in plain strings, typically using the international standard encoding known as ISO-8859-1 (or ISO-8859-15, if you need the Euro currency symbol as well).
In Python, you express a literal string (curiously more often known as a string literal) as:
'this is a literal string' "this is another string"
String values can be enclosed in either single or double quotes. The two different kinds of quotes work the same way, but having both allows you to include one kind of quotes inside of a string specified with the other kind of quotes, without needing to escape them with the backslash character:
'isn't that grand' "isn't that grand"
To have a string literal span multiple lines, you can use a backslash as the last character on the line, which indicates that the next line is a continuation:
big = "This is a long string that spans two lines."
You must embed newlines in the string if you want the string to output on two lines:
big = "This is a long string that prints on two lines."
Another approach is to enclose the string in a pair of matching triple quotes (either single or double):
bigger = """ This is an even bigger string that spans three lines. """
Using triple quotes, you don’t need to use the continuation
character, and line breaks in the string literal are preserved as
newline characters in the resulting Python string object. You can also
make a string literal "raw" string by preceding
it with an r
or R
:
big = r"This is a long string with a backslash and a newline in it"
With a raw string, backslash escape sequences are left alone,
rather than being interpreted. Finally, you can precede a string
literal with a u
or U
to make it a Unicode string:
hello = u'Hellou0020World'
Strings are immutable, which means that no matter what operation you do on a string, you will always produce a new string object, rather than mutating the existing string. A string is a sequence of characters, which means that you can access a single character by indexing:
mystr = "my string" mystr[0] # 'm' mystr[-2] # 'n'
You can also access a portion of the string with a slice:
mystr[1:4] # 'y s' mystr[3:] # 'string' mystr[-3:] # 'ing'
Slices can be extended, that is, include a third parameter that is known as the stride or step of the slice:
mystr[:3:-1] # 'gnirt' mystr[1::2] # 'ysrn'
You can loop on a string’s characters:
for c in mystr:
This binds c
to each of the
characters in mystr
in turn. You can form another
sequence:
list(mystr) # returns ['m','y',' ','s','t','r','i','n','g']
You can concatenate strings by addition:
mystr+'oid' # 'my stringoid'
You can also repeat strings by multiplication:
'xo'*3 # 'xoxoxo'
In general, you can do anything to a string that you can do to any other sequence, as long as it doesn’t require changing the sequence, since strings are immutable.
String objects have many useful methods. For example,
you can test a string’s contents with s.isdigit( )
, which returns True
if s
is not
empty and all of the characters in s
are
digits (otherwise, it returns False
). You can produce a new modified
string with a method call such as s.toupper(
)
, which returns a new string that is like s
, but with every letter changed into its
uppercase equivalent. You can search for a string inside another with
haystack.count('needle')
, which
returns the number of times the substring 'needle
' appears in the string
haystack
. When you have a large string that
spans multiple lines, you can split it into a list of single-line
strings with splitlines
:
list_of_lines = one_large_string.splitlines( )
You can produce the single large string again with join
:
one_large_string = ' '.join(list_of_lines)
The recipes in this chapter show off many methods of the string object. You can find complete documentation in Python’s Library Reference and Python in a Nutshell.
Strings in Python can also be manipulated with regular
expressions, via the re
module.
Regular expressions are a powerful (but complicated) set of tools that
you may already be familiar with from another language (such as Perl),
or from the use of tools such as the vi editor and text-mode commands such as
grep. You’ll find a number of uses
of regular expressions in recipes in the second half of this chapter.
For complete documentation, see the Library
Reference and Python in a Nutshell.
J.E.F. Friedl, Mastering Regular Expressions
(O’Reilly) is also recommended if you need to master this
subject—Python’s regular expressions are basically the same as Perl’s,
which Friedl covers thoroughly.
Python’s standard module string
offers much of the same functionality
that is available from string methods, packaged up as functions
instead of methods. The string
module also offers a few additional functions, such as the useful
string.maketrans
function that is
demonstrated in a few recipes in this chapter; several helpful string
constants (string.digits
, for
example, is '0123456789
') and, in
Python 2.4, the new class Template
,
for simple yet flexible formatting of strings with embedded variables,
which as you’ll see features in one of this chapter’s recipes. The
string-formatting operator, %
,
provides a handy way to put strings together and to obtain precisely
formatted strings from such objects as floating-point numbers. Again,
you’ll find recipes in this chapter that show how to use %
for your purposes. Python also has lots of
standard and extension modules that perform special processing on
strings of many kinds. This chapter doesn’t cover such specialized
resources, but Chapter 12
is, for example, entirely devoted to the important specialized subject
of processing XML.
You want to process a string one character at a time.
You can build a list whose items are the string’s
characters (meaning that the items are strings, each of length of
one—Python doesn’t have a special type for “characters” as distinct
from strings). Just call the built-in list
, with the string as its
argument:
thelist = list(thestring)
You may not even need to build the list, since you can loop
directly on the string with a for
statement:
for c in thestring: do_something_with(c)
or in the for
clause of a
list comprehension:
results = [do_something_with(c) for c in thestring]
or, with exactly the same effects as this list comprehension,
you can call a function on each character with the map
built-in function:
results = map(do_something, thestring)
In Python, characters are just strings of length one.
You can loop over a string to access each of its characters, one by
one. You can use map
for much the
same purpose, as long as what you need to do with each character is
call a function on it. Finally, you can call the built-in type
list
to obtain a list of the
length-one substrings of the string (i.e., the string’s characters).
If what you want is a set whose elements are the string’s characters,
you can call sets.Set
with the
string as the argument (in Python 2.4, you can also call the built-in
set
in just the same way):
import sets
magic_chars = sets.Set('abracadabra')
poppins_chars = sets.Set('supercalifragilisticexpialidocious')
print ''.join(magic_chars & poppins_chars) # set intersectionacrd
The Library Reference section on sequences; Perl Cookbook Recipe 1.5.
Credit: Luther Blissett
That’s what the built-in functions ord
and chr
are for:
>>> print ord('a')97
>>> print chr(97)a
The built-in function ord
also accepts as its argument a Unicode string of length one, in which
case it returns a Unicode code value, up to 65536. To make a Unicode
string of length one from a numeric Unicode code value, use the
built-in function unichr
:
>>> print ord(u'u2020')8224
>>> print repr(unichr(8224))u'u2020'
It’s a mundane task, to be sure, but it is sometimes useful to
turn a character (which in Python just means a string of length one)
into its ASCII or Unicode code, and vice versa. The built-in functions
ord
, chr
, and unichr
cover all the related needs. Note, in
particular, the huge difference between chr(n)
and str(n)
, which beginners sometimes
confuse...:
>>> print repr(chr(97))'a'
>>> print repr(str(97))'97'
chr
takes as its argument a
small integer and returns the corresponding single-character string
according to ASCII, while str
,
called with any integer, returns the string that is the decimal
representation of that integer.
To turn a string into a list of character value codes, use the
built-in functions map
and ord
together, as follows:
>>> print map(ord, 'ciao') [99, 105, 97, 111]
To build a string from a list of character codes, use
''.join
, map
and chr
; for example:
>>> print ''.join(map(chr, range(97, 100)))abc
Documentation for the built-in functions chr
, ord
,
and unichr
in the
Library Reference and Python in a
Nutshell.
You need to test if an object, typically an argument to a function or method you’re writing, is a string (or more precisely, whether the object is string-like).
A simple and fast way to check whether something is a
string or Unicode object is to use the built-ins isinstance
and basestring
, as follows:
def isAString(anobj): return isinstance(anobj, basestring)
The first approach to solving this recipe’s problem that comes to many programmers’ minds is type-testing:
def isExactlyAString(anobj): return type(anobj) is type('')
However, this approach is pretty bad, as it willfully destroys
one of Python’s greatest strengths—smooth, signature-based
polymorphism. This kind of test would reject Unicode objects,
instances of user-coded subclasses of str
, and instances of any user-coded type
that is meant to be “string-like”.
Using the isinstance
built-in
function, as recommended in this recipe’s Solution, is much better.
The built-in type basestring
exists
exactly to enable this approach. basestring
is a common base class for the
str
and unicode
types, and any string-like type that
user code might define should also subclass basestring
, just to make sure that such
isinstance
testing works as
intended. basestring
is essentially
an “empty” type, just like object
,
so no cost is involved in subclassing it.
Unfortunately, the canonical isinstance
checking fails to accept such
clearly string-like objects as instances of the UserString
class from Python Standard
Library module UserString
, since
that class, alas, does not inherit from basestring
. If you need to support such
types, you can check directly whether an object behaves like a
string—for example:
def isStringLike(anobj): try: anobj + '' except: return False else: return True
This isStringLike
function is
slower and more complicated than the isAString
function presented in the
“Solution”, but it does accept instances of UserString
(and other string-like types) as
well as instances of str
and
unicode
.
The general Python approach to type-checking is known as
duck typing: if it walks like a duck and quacks
like a duck, it’s duck-like enough for our purposes. The
isStringLike
function in this recipe goes only as far
as the quacks-like part, but that may be enough. If and when you need
to check for more string-like features of the object anobj
, it’s easy to test a few more
properties by using a richer expression in the try
clause—for example, changing the clause
to:
try: anobj.lower( ) + anobj + ''
In my experience, however, the simple test shown in the
isStringLike
function usually does what I
need.
The most Pythonic approach to type validation (or any
validation task, really) is just to try to perform whatever task you
need to do, detecting and handling any errors or exceptions that might
result if the situation is somehow invalid—an approach known as “it’s
easier to ask forgiveness than permission” (EAFP). try
/except
is the key tool in enabling the EAFP
style. Sometimes, as in this recipe, you may choose some simple task,
such as concatenation to the empty string, as a stand-in for a much
richer set of properties (such as, all the wealth of operations and
methods that string objects make available).
Documentation for the built-ins isinstance
and basestring
in the Library
Reference and Python in a
Nutshell.
You want to align strings: left, right, or center.
That’s what the ljust
, rjust
, and center
methods of string objects are for.
Each takes a single argument, the width of the string you want as a
result, and returns a copy of the starting string with spaces added on
either or both sides:
>>> print '|', 'hej'.ljust(20), '|', 'hej'.rjust(20), '|', 'hej'.center(20), '|'| hej | hej | hej |
Centering, left-justifying, or right-justifying text comes up surprisingly often—for example, when you want to print a simple report with centered page numbers in a monospaced font. Because of this, Python string objects supply this functionality through three of their many methods. In Python 2.3, the padding character is always a space. In Python 2.4, however, while space-padding is still the default, you may optionally call any of these methods with a second argument, a single character to be used for the padding:
>>> print 'hej'.center(20, '+')++++++++hej+++++++++
The Library Reference section on string methods; Java Cookbook recipe 3.5.
Credit: Luther Blissett
You need to work on a string without regard for any extra leading or trailing spaces a user may have typed.
That’s what the lstrip
, rstrip
, and strip
methods of string objects are for.
Each takes no argument and returns a copy of the starting string,
shorn of whitespace on either or both sides:
>>> x = ' hej '
>>> print '|', x.lstrip( ), '|', x.rstrip( ), '|', x.strip( ), '|'| hej | hej | hej |
Just as you may need to add space to either end of a string to align that string left, right, or center in a field of fixed width (as covered previously in Recipe 1.4), so may you need to remove all whitespace (blanks, tabs, newlines, etc.) from either or both ends. Because this need is frequent, Python string objects supply this functionality through three of their many methods. Optionally, you may call each of these methods with an argument, a string composed of all the characters you want to trim from either or both ends instead of trimming whitespace characters:
>>> x = 'xyxxyy hejyx yyx'
>>> print '|'+x.strip('xy')+'|'| hejyx |
Note that in these cases the leading and trailing spaces have
been left in the resulting string, as have the 'yx
' that are followed by spaces: only all
the occurrences of 'x'
and
'y'
at either end of the string have been
removed from the resulting string.
The Library Reference section on string methods; Recipe 1.4; Java Cookbook recipe 3.12.
Credit: Luther Blissett
You have several small strings that you need to combine into one larger string.
To join a sequence of small strings into one large string, use
the string operator join
. Say that
pieces
is a list whose items are
strings, and you want one big string with all the items concatenated
in order; then, you should code:
largeString = ''.join(pieces)
To put together pieces stored in a few variables, the
string-formatting operator %
can
often be even handier:
largeString = '%s%s something %s yet more' % (small1, small2, small3)
In Python, the +
operator concatenates strings and therefore offers seemingly obvious
solutions for putting small strings together into a larger one. For
example, when you have pieces stored in a few variables, it seems
quite natural to code something like:
largeString = small1 + small2 + ' something ' + small3 + ' yet more'
And similarly, when you have a sequence of small strings named
pieces
, it seems quite natural to code
something like:
largeString = '' for piece in pieces: largeString += piece
Or, equivalently, but more fancifully and compactly:
import operator largeString = reduce(operator.add, pieces, '')
However, it’s very important to realize that none of these seemingly obvious solution is good—the approaches shown in the “Solution” are vastly superior.
In Python, string objects are immutable. Therefore, any
operation on a string, including string concatenation, produces a new
string object, rather than modifying an existing one. Concatenating
N
strings thus involves building and then
immediately throwing away each of N
-1
intermediate results. Performance is therefore vastly better for
operations that build no intermediate results, but rather produce the
desired end result at once.
Python’s string-formatting operator %
is one such operation, particularly
suitable when you have a few pieces (e.g., each bound to a different
variable) that you want to put together, perhaps with some constant
text in addition. Performance is not a major issue for this specific
kind of task. However, the %
operator also has other potential advantages, when compared to an
expression that uses multiple + operations on strings. % is more
readable, once you get used to it. Also, you don’t have to call
str
on pieces that aren’t already
strings (e.g., numbers), because the format specifier %s
does so implicitly. Another advantage is
that you can use format specifiers other than %s
, so that, for example, you can control
how many significant digits the string form of a floating-point number
should display.
When you have many small string pieces in a sequence,
performance can become a truly important issue. The time needed to
execute a loop using +
or +=
(or a fancier but equivalent approach
using the built-in function reduce
)
grows with the square of the number of characters you are
accumulating, since the time to allocate and fill a large string is
roughly proportional to the length of that string. Fortunately, Python
offers an excellent alternative. The join
method of a string object
s
takes as its only argument a sequence of
strings and produces a string result obtained by concatenating all
items in the sequence, with a copy of s
joining each item to its neighbors. For example, ''.join(pieces)
concatenates all the items of
pieces
in a single gulp, without
interposing anything between them, and ',
'.join(pieces)
concatenates the items putting a comma and a
space between each pair of them. It’s the fastest, neatest, and most
elegant and readable way to put a large string together.
When the pieces are not all available at the same time,
but rather come in sequentially from input or computation, use a list
as an intermediate data structure to hold the pieces (to add items at
the end of a list, you can call the append
or extend
methods of the list). At the end,
when the list of pieces is complete, call ''.join(thelist)
to obtain the big string
that’s the concatenation of all pieces. Of all the many handy tips and
tricks I could give you about Python strings, I consider this one
by far the most significant: the most frequent
reason some Python programs are too slow is that they build up big
strings with +
or +=
. So, train yourself never to do that.
Use, instead, the ''.join
approach
recommented in this recipe.
Python 2.4 makes a heroic attempt to ameliorate the issue,
reducing a little the performance penalty due to such erroneous use of
+=
. While ''.join
is still way faster and in all ways
preferable, at least some newbie or careless programmer gets to waste
somewhat fewer machine cycles. Similarly, psyco (a specializing
just-in-time [JIT] Python compiler found at http://psyco.sourceforge.net/), can reduce the
+=
penalty even further.
Nevertheless, ''.join
remains the
best approach in all cases.
The Library Reference and
Python in a Nutshell sections on string
methods, string-formatting operations, and the operator
module.
You want to reverse the characters or words in a string.
Strings are immutable, so, to reverse one, we need to make a copy. The simplest approach for reversing is to take an extended slice with a “step” of -1, so that the slicing proceeds backwards:
revchars = astring[::-1]
To flip words, we need to make a list of words, reverse it, and join it back into a string with a space as the joiner:
revwords = astring.split( ) # string -> list of words revwords.reverse( ) # reverse the list in place revwords = ' '.join(revwords) # list of strings -> string
or, if you prefer terse and compact “one-liners”:
revwords = ' '.join(astring.split( )[::-1])
If you need to reverse by words while preserving untouched the intermediate whitespace, you can split by a regular expression:
import re revwords = re.split(r'(s+)', astring) # separators too, since '(...)' revwords.reverse( ) # reverse the list in place revwords = ''.join(revwords) # list of strings -> string
Note that the joiner must be the empty string in this case,
because the whitespace separators are kept in the
revwords
list (by using re.split
with a regular expression that
includes a parenthesized group). Again, you could make a one-liner, if
you wished:
revwords = ''.join(re.split(r'(s+)', astring)[::-1])
but this is getting too dense and unreadable to be good Python code!
In Python 2.4, you may make the by-word one-liners more readable
by using the new built-in function reversed
instead of the less readable
extended-slicing indicator [::-1]
:
revwords = ' '.join(reversed(astring.split( ))) revwords = ''.join(reversed(re.split(r'(s+)', astring)))
For the by-character case, though, astring[::-1]
remains best, even in 2.4,
because to use reversed
, you’d have
to introduce a call to ''.join
as
well:
revchars = ''.join(reversed(astring))
The new reversed
built-in
returns an iterator, suitable for looping on or
for passing to some “accumulator” callable such as ''.join
—it does not return a ready-made
string!
Library Reference and Python
in a Nutshell docs on sequence types and slicing, and (2.4
only) the reversed
built-in;
Perl Cookbook recipe 1.6.
Credit: Jürgen Hermann, Horst Hansen
You need to check for the occurrence of any of a set of characters in a string.
The simplest approach is clear, fast, and general (it works for any sequence, not just strings, and for any container on which you can test for membership, not just sets):
def containsAny(seq, aset): """ Check whether sequence seq contains ANY of the items in aset. """ for c in seq: if c in aset: return True return False
You can gain a little speed by moving to a higher-level, more
sophisticated approach, based on the itertools
standard library module,
essentially expressing the same approach in a different way:
import itertools def containsAny(seq, aset): for item in itertools.ifilter(aset._ _contains_ _, seq): return True return False
Most problems related to sets are best handled by using the
set
built-in type introduced in
Python 2.4 (if you’re using Python 2.3, you can use the equivalent
sets.Set
type from the Python
Standard Library). However, there are exceptions. Here, for example, a
pure set-based approach would be something like:
def containsAny(seq, aset): return bool(set(aset).intersection(seq))
However, with this approach, every item in
seq
inevitably has to be examined. The functions in
this recipe’s Solution, on the other hand, “short-circuit”: they
return as soon as they know the answer. They must still check every
item in seq
when the answer is False
—we could never affirm that no item in
seq
is a member of aset
without
examining all the items, of course. But when the answer is True
, we often learn about that very soon,
namely as soon as we examine one item that is a
member of aset
. Whether this matters at all is very
data-dependent, of course. It will make no practical difference when
seq
is short, or when the answer is typically
False
, but it may be extremely
important for a very long seq
(when the answer can
typically be soon determined to be True
).
The first version of containsAny
presented in the recipe has the advantage of simplicity and clarity:
it expresses the fundamental idea with total transparency. The second
version may appear to be “clever”, and that is not a complimentary
adjective in the Python world, where simplicity and clarity are core
values. However, the second version is well worth considering, because
it shows a higher-level approach, based on the itertools
module of the standard library.
Higher-level approaches are most often preferable to lower-level ones
(although the issue is moot in this particular case). itertools.ifilter
takes a predicate and an
iterable, and yields the items in that iterable that satisfy the
“predicate”. Here, as the “predicate”, we use aset._
_contains_ _
, the bound method that is internally called when
we code in aset
for membership
testing. So, if ifilter
yields
anything at all, it yields an item of seq
that is
also a member of aset
, so we can return True
as soon as this happens. If we
get to the statement following the for
, it must mean the return True
never executed, because no items
of seq
are members of aset
, so we
can return False
.
If your application needs some function such as
containsAny
to check whether a string (or other
sequence) contains any members of a set, you may also need such
variants as:
def containsOnly(seq, aset): """ Check whether sequence seq contains ONLY items in aset. """ for c in seq: if c not in aset: return False return True
containsOnly
is the same function as containsAny
,
but with the logic turned upside-down. Other apparently similar tasks
don’t lend themselves to short-circuiting (they intrinsically need to
examine all items) and so are best tackled by using the built-in type
set
(in Python 2.4; in 2.3, you can
use sets.Set
in the same
way):
def containsAll(seq, aset): """ Check whether sequence seq contains ALL the items in aset. """ return not set(aset).difference(seq)
If you’re not accustomed to using the set
(or sets.Set
) method difference
, be aware of its semantics: for
any set
a
, a.difference(b)
(just like a-set(b)
) returns the set of all elements of
a
that are not in
b
. For example:
>>> L1 = [1, 2, 3, 3] >>> L2 = [1, 2, 3, 4] >>> set(L1).difference(L2)set([ ])
>>> set(L2).difference(L1)set([4])
which hopefully helps explain why:
>>> containsAll(L1, L2)False
>>> containsAll(L2, L1)True
(In other words, don’t confuse difference
with another method of set
, symmetric_difference
, which returns the set
of all items that are in either argument and not
in the other.)
When you’re dealing specifically with (plain,
not Unicode) strings for both
seq
and aset
, you may not need the
full generality of the functions presented in this recipe, and may
want to try the more specialized approach explained in Recipe 1.10 based on
strings’ method translate
and the
string.maketrans
function from the
Python Standard Library. For example:
import string notrans = string.maketrans('', '') # identity "translation" def containsAny(astr, strset): return len(strset) != len(strset.translate(notrans, astr)) def containsAll(astr, strset): return not strset.translate(notrans, astr)
This somewhat tricky approach relies on strset.translate(notrans, astr)
being the
subsequence of strset
that is made of characters not
in astr
. When that subsequence has
the same length as strset
, no
characters have been removed by strset.translate
,
therefore no characters of strset
are in astr
. Conversely, when the subsequence is
empty, all characters have been removed, so all characters of
strset
are in astr
. The translate
method keeps coming up naturally
when one wants to treat strings as sets of characters, because it’s
speedy as well as handy and flexible; see Recipe 1.10 for more
details.
These two sets of approaches to the recipe’s tasks have very
different levels of generality. The earlier approaches are very
general: not at all limited to string processing, they make rather
minimal demands on the objects you apply them to. The approach based
on the translate
method, on the
other hand, works only when both astr
and strset
are strings, or
very closely mimic plain strings’ functionality.
Not even Unicode strings suffice, because the translate
method of Unicode strings has a
signature that is different from that of plain strings—a single
argument (a dict
mapping code
numbers to Unicode strings or None
)
instead of two (both strings).
Recipe 1.10;
documentation for the translate
method of strings and Unicode objects, and maketrans
function in the string
module, in the Library
Reference and Python in a Nutshell;
ditto for documentation of built-in set
(Python 2.4 only), modules sets
and itertools
, and the special method _ _contains_ _
.
Credit: Chris Perkins, Raymond Hettinger
You often want to use the fast code in strings’ translate
method, but find it hard to
remember in detail how that method and the function string.maketrans
work, so you want a handy
facade to simplify their use in typical
cases.
The translate
method of
strings is quite powerful and flexible, as detailed in Recipe 1.10. However,
exactly because of that power and flexibility, it may be a nice idea
to front it with a “facade” that simplifies its
typical use. A little factory function, returning
a closure, can do wonders for this kind of task:
import string def translator(frm='', to='', delete='', keep=None): if len(to) == 1: to = to * len(frm) trans = string.maketrans(frm, to) if keep is not None: allchars = string.maketrans('', '') delete = allchars.translate(allchars, keep.translate(allchars, delete)) def translate(s): return s.translate(trans, delete) return translate
I often find myself wanting to use strings’ translate
method for any one of a few
purposes, but each time I have to stop and think about the details
(see Recipe 1.10 for
more information about those details). So, I wrote myself a class
(later remade into the factory closure presented in this recipe’s
Solution) to encapsulate various possibilities behind a simpler-to-use
facade. Now, when I want a function that keeps only characters from a
given set, I can easily build and use that function:
>>> digits_only = translator(keep=string.digits)
>>> digits_only('Chris Perkins : 224-7992')'2247992'
It’s similarly simple when I want to remove a set of characters:
>>> no_digits = translator(delete=string.digits)
>>> no_digits('Chris Perkins : 224-7992')'Chris Perkins : -'
and when I want to replace a set of characters with a single character:
>>> digits_to_hash = translator(from=string.digits, to='#')
>>> digits_to_hash('Chris Perkins : 224-7992')'Chris Perkins : ###-####'
While the latter may appear to be a bit of a special case, it is a task that keeps coming up for me every once in a while.
I had to make one arbitrary design decision in this
recipe—namely, I decided that the delete
parameter
“trumps” the keep
parameter if they overlap:
>>> trans = translator(delete='abcd', keep='cdef')
>>> trans('abcdefg')'ef'
For your applications it might be preferable to ignore
delete
if keep
is specified, or,
perhaps better, to raise an exception if they are both specified,
since it may not make much sense to let them both be given in the same
call to translator
, anyway. Also: as noted in Recipe 1.8 and Recipe 1.10, the code in
this recipe works only for normal strings, not for
Unicode strings. See Recipe 1.10 to learn how to
code this kind of functionality for Unicode strings, whose translate
method is different from that of
plain (i.e., byte) strings.
Recipe 1.10 for
a direct equivalent of this recipe’s translator(keep=...)
, more information on
the translate
method, and an
equivalent approach for Unicode strings; documentation for strings’
translate
method, and for the
maketrans
function in the string
module, in the Library
Reference and Python in a
Nutshell.
Credit: Jürgen Hermann, Nick Perkins, Peter Cogolo
Given a set of characters to keep, you need to build a filtering
function that, applied to any string s
,
returns a copy of s
that contains only
characters in the set.
The translate
method of
string objects is fast and handy for all tasks of this ilk. However,
to call translate
effectively to
solve this recipe’s task, we must do some advance preparation. The
first argument to translate
is a
translation table: in this recipe, we do not want to do any
translation, so we must prepare a first argument that specifies “no
translation”. The second argument to translate
specifies which characters we want
to delete: since the task here says that we’re
given, instead, a set of characters to keep
(i.e., to not delete), we must prepare a second
argument that gives the set complement—deleting
all characters we must not keep. A closure is the best way to do this
advance preparation just once, obtaining a fast filtering function
tailored to our exact needs:
import string # Make a reusable string of all characters, which does double duty # as a translation table specifying "no translation whatsoever"allchars = string.maketrans('', '') def makefilter(keep): """ Return a function that takes a string and returns a partial copy of that string consisting of only the characters in 'keep'. Note that `keep' must be a plain string. """ # Make a string of all characters that are not in 'keep': the "set # complement" of keep, meaning the string of characters we must delete delchars = allchars.translate(allchars, keep) # Make and return the desired filtering function (as a closure) def thefilter(s): return s.translate(allchars, delchars) return thefilter if _ _name_ _ == '_ _main_ _': just_vowels = makefilter('aeiouy') print just_vowels('four score and seven years ago') # emits:ouoeaeeyeaao
print just_vowels('tiger, tiger burning bright') # emits:ieieuii
The key to understanding this recipe lies in the
definitions of the maketrans
function in the string
module of
the Python Standard Library and in the translate
method of string objects. translate
returns a copy of the string you
call it on, replacing each character in it with the corresponding
character in the translation table passed in as the first argument and
deleting the characters specified in the second argument. maketrans
is a utility function to create
translation tables. (A translation table is a string
t
of exactly 256 characters: when you pass
t
as the first argument of a translate
method, each character
c
of the string on which you call the
method is translated in the resulting string into the character
t[ord(c)]
.)
In this recipe, efficiency is maximized by splitting the
filtering task into preparation and execution phases. The string of
all characters is clearly reusable, so we build it once and for all as
a global variable when this module is imported. That way, we ensure
that each filtering function uses the same string-of-all-characters
object, not wasting any memory. The string of characters to delete,
which we need to pass as the second argument to the translate
method, depends on the set of
characters to keep, because it must be built as the “set complement”
of the latter: we must tell translate
to delete every character that we
do not want to keep. So, we build the delete-these-characters string
in the makefilter
factory function.
This building is done quite rapidly by using the translate
method to delete the “characters
to keep” from the string of all characters. The translate
method is very fast, as are the
construction and execution of these useful little resulting functions.
The test code that executes when this recipe runs as a main script
shows how to build a filtering function by calling makefilter
, bind a name to the filtering
function (by simply assigning the result of calling makefilter
to a name), then call the
filtering function on some strings and print the results.
Incidentally, calling a filtering function with
allchars
as the argument puts the set of characters
being kept into a canonic string form, alphabetically sorted and
without duplicates. You can use this idea to code a very simple
function to return the canonic form of any set of characters presented
as an arbitrary string:
def canonicform(s): """ Given a string s, return s's characters as a canonic-form string: alphabetized and without duplicates. """ return makefilter(s)(allchars)
The Solution uses a def
statement to make the nested function (closure) it returns, because
def
is the most normal, general,
and clear way to make functions. If you prefer, you could use lambda
instead, changing the def
and return
statements in function
makefilter
into just one return lambda
statement:
return lambda s: s.translate(allchars, delchars)
Most Pythonistas, but not all, consider using def
clearer and more readable than using
lambda
.
Since this recipe deals with strings seen as sets of characters,
you could alternatively use the sets.Set
type (or, in Python 2.4, the new
built-in set
type) to perform the
same tasks. Thanks to the translate
method’s power and speed, it’s often faster to work directly on
strings, rather than go through sets, for tasks of this ilk. However,
just as noted in Recipe
1.8, the functions in this recipe only work for normal strings,
not for Unicode strings.
To solve this recipe’s task for Unicode strings, we must do some
very different preparation. A Unicode string’s translate
method takes only one argument: a
mapping or sequence, which is indexed with the code number of each
character in the string. Characters whose codes are not keys in the
mapping (or indices in the sequence) are just copied over to the
output string. Otherwise, the value corresponding to each character’s
code must be either a Unicode string (which is substituted for the
character) or None
(in which case
the character is deleted). A very nice and powerful arrangement, but
unfortunately not one that’s identical to the way plain strings work,
so we must recode.
Normally, we use either a dict
or a list
as the argument to a Unicode string’s
translate
method to translate some
characters and/or delete some. But for the specific task of this
recipe (i.e., keep just some characters, delete
all others), we might need an inordinately large dict
or string
, just mapping all other characters to
None
. It’s better to code, instead,
a little class that appropriately implements a _ _getitem_ _
method (the special method
that gets called in indexing operations). Once we’re going to the
(slight) trouble of coding a little class, we might as well make its
instances callable and have makefilter
be just a
synonym for the class itself:
import sets
class Keeper(object):
def _ _init_ _(self, keep):
self.keep = sets.Set(map(ord, keep))
def _ _getitem_ _(self, n):
if n not in self.keep:
return None
return unichr(n)
def _ _call_ _(self, s):
return unicode(s).translate(self)
makefilter = Keeper
if _ _name_ _ == '_ _main_ _':
just_vowels = makefilter('aeiouy')
print just_vowels(u'four score and seven years ago')
# emits: ouoeaeeyeaao
print just_vowels(u'tiger, tiger burning bright')
# emits: ieieuii
We might name the class itself makefilter
, but,
by convention, one normally names classes with an uppercase initial;
there is essentially no cost in following that convention here, too,
so we did.
Recipe 1.8;
documentation for the translate
method of strings and Unicode objects, and maketrans
function in the string
module, in the Library
Reference and Python in a
Nutshell.
Credit: Andrew Dalke
Python can use a plain string to hold either text or arbitrary bytes, and you need to determine (heuristically, of course: there can be no precise algorithm for this) which of the two cases holds for a certain string.
We can use the same heuristic criteria as Perl does, deeming a string binary if it contains any nulls or if more than 30% of its characters have the high bit set (i.e., codes greater than 126) or are strange control codes. We have to code this ourselves, but this also means we easily get to tweak the heuristics for special application needs:
from _ _future_ _ import division # ensure / does NOT truncate import string text_characters = "".join(map(chr, range(32, 127))) + " " _null_trans = string.maketrans("", "") def istext(s, text_characters=text_characters, threshold=0.30): # if s contains any null, it's not text: if " " in s: return False # an "empty" string is "text" (arbitrary but reasonable choice): if not s: return True # Get the substring of s made up of non-text characters t = s.translate(_null_trans, text_characters) # s is 'text' if less than 30% of its characters are non-text ones: return len(t)/len(s) <= threshold
You can easily do minor customizations to the heuristics used by
function istext
by passing in specific values for the
threshold
, which defaults to 0.30 (30%), or for the
string of those characters that are to be deemed “text” (which
defaults to normal ASCII characters plus the four “normal” control
characters, meaning ones that are often found in text). For example,
if you expected Italian text encoded as ISO-8859-1, you could add the
accented letters used in Italian, "àèéìòù
“, to the
text_characters
argument.
Often, what you need to check as being either binary or text is
not a string, but a file. Again, we can use the same heuristics as
Perl, checking just the first block of the file with the
istext
function shown in this recipe’s
Solution:
def istextfile(filename, blocksize=512, **kwds): return istext(open(filename).read(blocksize), **kwds)
Note that, by default, the expression len(t)/len(s)
used in the body of function
istext
would truncate the result to 0, since it is a
division between integer numbers. In some future version (probably
Python 3.0, a few years away), Python will change the meaning of the
/
operator so that it performs
division without truncation—if you really do want truncation, you
should use the truncating-division operator, //
.
However, Python has not yet changed the semantics of division, keeping the old one by default in order to ensure backwards compatibility. It’s important that the millions of lines of code of Python programs and modules that already exist keep running smoothly under all new 2.x versions of Python—only upon a change of major language version number, no more often than every decade or so, is Python allowed to change in ways that aren’t backwards-compatible.
Since, in the small module containing this recipe’s Solution, it’s handy for us to get the division behavior that is scheduled for introduction in some future release, we start our module with the statement:
from _ _future_ _ import division
This statement doesn’t affect the rest of the program,
only the specific module that starts with this statement; throughout
this module, /
performs “true
division” (without truncation). As of Python 2.3 and 2.4, division
is the only thing you may want to
import from _ _future_ _
. Other
features that used to be scheduled for the future, nested_scopes
and generators
, are now part of the language and
cannot be turned off—it’s innocuous to import them, but it makes sense
to do so only if your program also needs to run under some older
version of Python.
Recipe 1.10 for
more details about function maketrans
and string method translate
; Language
Reference for details about true versus truncating
division.
You need to convert a string from uppercase to lowercase, or vice versa.
That’s what the upper
and lower
methods of string objects
are for. Each takes no arguments and returns a copy of the string in
which each letter has been changed to upper- or lowercase,
respectively.
big = little.upper( ) little = big.lower( )
Characters that are not letters are copied unchanged.
s.capitalize
is similar to
s[:1].upper( )+s[1:].lower( )
: the
first character is changed to uppercase, and all others are changed to
lowercase. s.title
is again
similar, but it capitalizes the first letter of each word (where a
“word” is a sequence of letters) and uses lowercase for all other
letters:
>>> print 'one tWo thrEe'.capitalize( )One two three
>>> print 'one tWo thrEe'.title( )One Two Three
Case manipulation of strings is a very frequent need. Because of
this, several string methods let you produce case-altered copies of
strings. Moreover, you can also check whether a string object is
already in a given case form, with the methods isupper
, islower
, and istitle
, which all return True
if the string is not empty, contains at
least one letter, and already meets the uppercase, lowercase, or
titlecase constraints. There is no analogous iscapitalized
method, and coding it is not
trivial, if we want behavior that’s strictly similar to strings’
is
... methods. Those methods all
return False
for an “empty” string,
and the three case-checking ones also return False
for strings that, while not empty,
contain no letters at all.
The simplest and clearest way to code
iscapitalized
is clearly:
def iscapitalized(s): return s == s.capitalize( )
However, this version deviates from the boundary-case semantics
of the analogous is
... methods,
since it also returns True
for
strings that are empty or contain no letters. Here’s a stricter
one:
import string notrans = string.maketrans('', '') # identity "translation" def containsAny(str, strset): return len(strset) != len(strset.translate(notrans, str)) def iscapitalized(s): return s == s.capitalize( ) and containsAny(s, string.letters)
Here, we use the function shown in Recipe 1.8 to ensure we
return False
if s
is empty or contains no letters. As noted in Recipe 1.8, this means that
this specific version works only for plain strings, not for Unicode
ones.
Library Reference and Python in a Nutshell docs on string methods; Perl Cookbook recipe 1.9; Recipe 1.8.
Credit: Alex Martelli
You want to access portions of a string. For example, you’ve read a fixed-width record and want to extract the record’s fields.
Slicing is great, but it only does one field at a time:
afield = theline[3:8]
If you need to think in terms of field lengths, struct.unpack
may be appropriate. For
example:
import struct # Get a 5-byte string, skip 3, get two 8-byte strings, then all the rest: baseformat = "5s 3x 8s 8s" # by how many bytes does theline exceed the length implied by this # base-format (24 bytes in this case, but struct.calcsize is general) numremain = len(theline) - struct.calcsize(baseformat) # complete the format with the appropriate 's' field, then unpack format = "%s %ds" % (baseformat, numremain) l, s1, s2, t = struct.unpack(format, theline)
If you want to skip rather than get "all the rest
“, then just unpack the initial
part of theline
with the right
length:
l, s1, s2 = struct.unpack(baseformat, theline[:struct.calcsize(baseformat)])
If you need to split at five-byte boundaries, you can easily code a list comprehension (LC) of slices:
fivers = [theline[k:k+5] for k in xrange(0, len(theline), 5)]
Chopping a string into individual characters is of course easier:
chars = list(theline)
If you prefer to think of your data as being cut up at specific columns, slicing with LCs is generally handier:
cuts = [8, 14, 20, 26, 30] pieces = [ theline[i:j] for i, j in zip([0]+cuts, cuts+[None]) ]
The call to zip
in this LC
returns a list of pairs of the form (cuts[k],
cuts[k+1])
, except that the first pair is (0, cuts[0])
, and the last one is (cuts[len(cuts)-1]
, None)
. In other words, each pair gives the
right (i, j)
for slicing between
each cut and the next, except that the first one is for the slice
before the first cut, and the last one is for the slice from the last
cut to the end of the string. The rest of the LC just uses these pairs
to cut up the appropriate slices of theline
.
This recipe was inspired by recipe 1.1 in the Perl
Cookbook. Python’s slicing takes the place of Perl’s
substr
. Perl’s built-in unpack
and Python’s struct.unpack
are similar. Perl’s is
slightly richer, since it accepts a field length of *
for the last field to mean all the rest.
In Python, we have to compute and insert the exact length for either
extraction or skipping. This isn’t a major issue because such
extraction tasks will usually be encapsulated into small functions.
Memoizing, also known as automatic
caching, may help with performance if the function is
called repeatedly, since it allows you to avoid redoing the
preparation of the format for the struct unpacking. See Recipe 18.5 for details
about memoizing.
In a purely Python context, the point of this recipe is to
remind you that struct.unpack
is
often viable, and sometimes preferable, as an alternative to string
slicing (not quite as often as unpack
versus substr
in Perl, given the lack of a *
-valued field length, but often enough to
be worth keeping in mind).
Each of these snippets is, of course, best encapsulated in a
function. Among other advantages, encapsulation ensures we don’t have
to work out the computation of the last field’s length on each and
every use. This function is the equivalent of the first snippet using
struct.unpack
in the
“Solution”:
def fields(baseformat, theline, lastfield=False): # by how many bytes does theline exceed the length implied by # base-format (struct.calcsize computes exactly that length) numremain = len(theline)-struct.calcsize(baseformat) # complete the format with the appropriate 's' or 'x' field, then unpack format = "%s %d%s" % (baseformat, numremain, lastfield and "s" or "x") return struct.unpack(format, theline)
A design decision worth noticing (and, perhaps, worth
criticizing) is that of having a lastfield=False
optional parameter. This
reflects the observation that, while we often want to skip the last,
unknown-length subfield, sometimes we want to retain it instead. The
use of lastfield
in the expression lastfield and
s
or
x
(equivalent to C’s ternary operator lastfield?
"s
“:”c
“)
saves an if
/else
, but it’s unclear whether the saving is
worth the obscurity. See Recipe 18.9 for more about
simulating ternary operators in Python.
If function fields
is called in a loop,
memoizing (caching) with a key that is the tuple (baseformat, len(theline), lastfield)
may
offer faster performance. Here’s a version of fields
with memoizing:
def fields(baseformat, theline, lastfield=False, _cache={ }): # build the key and try getting the cached format string key = baseformat, len(theline), lastfield format = _cache.get(key) if format is None: # no format string was cached, build and cache it numremain = len(theline)-struct.calcsize(baseformat) _cache[key] = format = "%s %d%s" % ( baseformat, numremain, lastfield and "s" or "x") return struct.unpack(format, theline)
The idea behind this memoizing is to perform the somewhat costly
preparation of format
only once for each set of
arguments requiring that preparation, thereafter storing it in the
_cache
dictionary. Of course, like all optimizations,
memoizing needs to be validated by measuring performance to check that
each given optimization does actually speed things up. In this case, I
measure an increase in speed of approximately 30% to 40% for the
memoized version, meaning that the optimization is probably not worth
the bother unless the function is part of a performance bottleneck for
your program.
The function equivalent of the next LC snippet in the solution is:
def split_by(theline, n, lastfield=False): # cut up all the needed pieces pieces = [theline[k:k+n] for k in xrange(0, len(theline), n)] # drop the last piece if too short and not required if not lastfield and len(pieces[-1]) < n: pieces.pop( ) return pieces
And for the last snippet:
def split_at(theline, cuts, lastfield=False): # cut up all the needed pieces pieces = [ theline[i:j] for i, j in zip([0]+cuts, cuts+[None]) ] # drop the last piece if not required if not lastfield: pieces.pop( ) return pieces
In both of these cases, a list comprehension doing slicing turns
out to be slightly preferable to the use of struct.unpack
.
A completely different approach is to use generators, such as:
def split_at(the_line, cuts, lastfield=False): last = 0 for cut in cuts: yield the_line[last:cut] last = cut if lastfield: yield the_line[last:] def split_by(the_line, n, lastfield=False): return split_at(the_line, xrange(n, len(the_line), n), lastfield)
Generator-based approaches are particularly appropriate when all
you need to do on the sequence of resulting fields is loop over it,
either explicitly, or implicitly by calling on it some “accumulator”
callable such as ''.join
. If you do
need to materialize a list of the fields, and what you have available
is a generator instead, you only need to call the built-in list
on the generator, as in:
list_of_fields = list(split_by(the_line, 5))
Recipe 18.9 and Recipe 18.5; Perl Cookbook recipe 1.1.
Credit: Tom Good
You have a string made up of multiple lines, and you need to build another string from it, adding or removing leading spaces on each line so that the indentation of each line is some absolute number of spaces.
The methods of string objects are quite handy, and let us write a simple function to perform this task:
def reindent(s, numSpaces): leading_space = numSpaces * ' ' lines = [ leading_space + line.strip( ) for line in s.splitlines( ) ] return ' '.join(lines)
When working with text, it may be necessary to change the indentation level of a block. This recipe’s code adds leading spaces to or removes them from each line of a multiline string so that the indentation level of each line matches some absolute number of spaces. For example:
>>> x = """ line one ... line two ... and line three ... """ >>> print xline one
line two
and line three
>>> print reindent(x, 4)line one
line two
and line three
Even if the lines in s
are initially indented
differently, this recipe makes their indentation homogeneous, which is
sometimes what we want, and sometimes not. A frequent need is to
adjust the amount of leading spaces in each line, so that the relative
indentation of each line in the block is preserved. This is not
difficult for either positive or negative values of the adjustment.
However, negative values need a check to ensure that no nonspace
characters are snipped from the start of the lines. Thus, we may as
well split the functionality into two functions to perform the
transformations, plus one to measure the number of leading spaces of
each line and return the result as a list:
def addSpaces(s, numAdd): white = " "*numAdd return white + white.join(s.splitlines(True)) def numSpaces(s): return [len(line)-len(line.lstrip( )) for line in s.splitlines( )] def delSpaces(s, numDel): if numDel > min(numSpaces(s)): raise ValueError, "removing more spaces than there are!" return ' '.join([ line[numDel:] for line in s.splitlines( ) ])
All of these functions rely on the string method
splitlines
, which is similar to a
split
on '
'
. splitlines
has the extra ability to leave
the trailing newline on each line (when you call it with True
as its argument). Sometimes this turns
out to be handy: addSpaces
could not be quite as
short and sweet without this ability of the splitlines
string method.
Here’s how we can combine these functions to build another function to delete just enough leading spaces from each line to ensure that the least-indented line of the block becomes flush left, while preserving the relative indentation of the lines:
def unIndentBlock(s): return delSpaces(s, min(numSpaces(s)))
Library Reference and Python in a Nutshell docs on sequence types.
Credit: Alex Martelli, David Ascher
You want to convert tabs in a string to the appropriate number of spaces, or vice versa.
Changing tabs to the appropriate number of spaces is a
reasonably frequent task, easily accomplished with Python strings’
expandtabs
method. Because strings
are immutable, the method returns a new string object, a modified copy
of the original one. However, it’s easy to rebind a string variable
name from the original to the modified-copy value:
mystring
=mystring
.expandtabs( )
This doesn’t change the string object to which
mystring
originally referred, but it does
rebind the name mystring
to a newly created
string object, a modified copy of mystring
in which tabs are expanded into runs of spaces. expandtabs
, by default, uses a tab length of
8; you can pass expandtabs
an
integer argument to use as the tab length.
Changing spaces into tabs is a rare and peculiar need. Compression, if that’s what you’re after, is far better performed in other ways, so Python doesn’t offer a built-in way to “unexpand” spaces into tabs. We can, of course, write our own function for the purpose. String processing tends to be fastest in a split/process/rejoin approach, rather than with repeated overall string transformations:
def unexpand(astring, tablen=8): import re # split into alternating space and non-space sequences pieces = re.split(r'( +)', astring.expandtabs(tablen)) # keep track of the total length of the string so far lensofar = 0 for i, piece in enumerate(pieces): thislen = len(piece) lensofar += thislen if piece.isspace( ): # change each space sequences into tabs+spaces numblanks = lensofar % tablen numtabs = (thislen-numblanks+tablen-1)/tablen pieces[i] = ' '*numtabs + ' '*numblanks return ''.join(pieces)
Function unexpand
, as written in this example,
works only for a single-line string; to deal with a multi-line string,
use ''.join([ unexpand(s) for s in
astring.splitlines(True) ])
.
While regular expressions are never indispensable for
the purpose of manipulating strings in Python, they are occasionally
quite handy. Function unexpand
, as presented in the
recipe, for example, takes advantage of one extra feature of re.split
with respect to string
’s split
method: when the regular expression
contains a (parenthesized) group, re.split
returns a list where the split
pieces are interleaved with the “splitter” pieces. So, here, we get
alternate runs of nonblanks and blanks as items of list
pieces
; the for
loop keeps track of the length of string
it has seen so far, and changes pieces that are made of blanks to as
many tabs as possible, plus as many blanks are needed to maintain the
overall length.
Some programming tasks that could still be described as
expanding tabs are unfortunately not quite as
easy as just calling the expandtabs
method. A category that does happen with some regularity is to fix
Python source files, which use a mix of tabs and spaces for
indentation (a very bad idea), so that they instead use spaces only
(which is the best approach). This could entail extra complications,
for example, when you need to guess the tab length (and want to end up
with the standard four spaces per indentation level, which is strongly
advisable). It can also happen when you need to preserve tabs that are
inside strings, rather than tabs being used for indentation (because
somebody erroneously used actual tabs, rather than '
', to indicate tabs in strings), or even
because you’re asked to treat docstrings differently from other
strings. Some cases are not too bad—for example, when you want to
expand tabs that occur only within runs of whitespace at the start of
each line, leaving any other tab alone. A little function using a
regular expression suffices:
def expand_at_linestart(P, tablen=8): import re def exp(mo): return mo.group( ).expand(tablen) return ''.join([ re.sub(r'^s+', exp, s) for s in P.splitlines(True) ])
This function expand_at_linestart
exploits the
re.sub
function, which looks for a
regular expression in a string and, each time it gets a match, calls a
function, passing the match object as the argument, to obtain the
string to substitute in place of the match. For convenience,
expand_at_linestart
is coded to deal with a multiline
string argument P
, performing the list comprehension
over the results of the splitlines
call, and the '
'.join
of the
whole. Of course, this convenience does not stop the function from
being able to deal with a single-line P
.
If your specifications regarding which tabs are to be expanded are even more complex, such as needing to deal differently with tabs depending on whether they’re inside or outside of strings, and on whether or not strings are docstrings, at the very least, you need to perform a tokenization. In addition, you may also have to perform a full parse of the source code you’re dealing with, rather than using simple string or regular-expression operations. If this is the case, you can expect a substantial amount of work. Some beginning pointers to help you get started may be found in Chapter 16.
If you ever find yourself sweating out this kind of task, you
will no doubt get excellent motivation in the future for following the
normal and recommended Python style in the source code you write or
edit: only spaces, four per indentation level, no tabs, and always
'
', never an actual tab
character, to include a tab in a string literal. Your favorite editor
can no doubt be told to enforce all of these conventions whenever a
Python source file is saved; the editor that comes with IDLE (the free
integrated development environment that comes with Python), for
example, supports these conventions. It is much
easier to arrange your editor so that the problem never arises, rather
than striving to fix it after the fact!
Documentation for the expandtabs
method of strings in the
“Sequence Types” section of the Library
Reference; Perl Cookbook recipe
1.7; Library Reference and Python in
a Nutshell documentation of module re
.
You need a simple way to get a copy of a string where specially marked substrings are replaced with the results of looking up the substrings in a dictionary.
Here is a solution that works in Python 2.3 as well as in 2.4:
def expand(format, d, marker='"', safe=False):
if safe:
def lookup(w): return d.get(w, w.join(marker*2))
else:
def lookup(w): return d[w]
parts = format.split(marker)
parts[1::2] = map(lookup, parts[1::2])
return ''.join(parts)
if _ _name_ _ == '_ _main_ _':
print expand('just "a" test', {'a': 'one'})
# emits:just one test
When the parameter safe
is False
, the default, every marked substring
must be found in dictionary d
, otherwise
expand
terminates with a KeyError
exception. When parameter
safe
is explicitly passed as True
, marked substrings that are not found
in the dictionary are just left intact in the output string.
The code in the body of the expand
function has some points of interest. It defines one of two different
nested functions (with the name of lookup
either
way), depending on whether the expansion is required to be
safe. Safe means no KeyError
exception gets raised for marked
strings not found in the dictionary. If not required to be safe (the
default), lookup
just indexes into dictionary
d
and raises an error if the substring is not found.
But, if lookup
is required to be “safe”, it uses
d
’s method get
and
supplies as the default the substring being looked up, with a marker
on either side. In this way, by passing safe
as
True
, you may choose to have
unknown formatting markers come right through to the output rather
than raising exceptions. marker+w+marker
would be an obvious
alternative to the chosen w.join(marker*2)
, but I’ve chosen the latter
exactly to display a non-obvious but interesting way to construct such
a quoted string.
With either version of lookup
,
expand
operates according to the split/modify/join
idiom that is so important for Python string processing. The
modify part, in expand
’s case,
makes use of the possibility of accessing and modifying a list’s slice
with a “step” or “stride”. Specifically, expand
accesses and rebinds all of those items of
parts
that lie at an odd index, because
those items are exactly the ones that were enclosed between a pair of
markers in the original format string. Therefore, they are the marked
substrings that may be looked up in the dictionary.
The syntax of format strings accepted by this recipe’s function
expand
is more flexible than the $
-based syntax of string.Template
. You can specify a different
marker
when you want your format
string to contain double quotes, for example. There is no constraint
for each specially marked substring to be an identifier, so you can
easily interpolate Python expressions (with a
d
whose _
_getitem_ _
performs an eval
) or any other kind of placeholder.
Moreover, you can easily get slightly different, useful effects. For
example:
print expand('just "a" ""little"" test', {'a' : 'one', '' : '"'})
emits just one "little" test
.
Advanced users can customize Python 2.4’s string.Template
class, by inheritance, to
match all of these capabilities, and more, but this recipe’s little
expand
function is still simpler to use in some
flexible ways.
Library Reference docs for string.Template
(Python 2.4, only), the
section on sequence types (for string methods split
and join
, and for slicing operations), and the
section on dictionaries (for indexing and the get
method). For more information on Python
2.4’s string.Template
class, see
Recipe 1.17.
Credit: John Nielsen, Lawrence Oluyede, Nick Coghlan
Using Python 2.4, you need a simple way to get a copy of a string where specially marked identifiers are replaced with the results of looking up the identifiers in a dictionary.
Python 2.4 offers the new string.Template
class for this purpose. Here
is a snippet of code showing how to use that class:
import string # make a template from a string where some identifiers are marked with $ new_style = string.Template('this is $thing') # use the substitute method of the template with a dictionary argument: print new_style.substitute({'thing':5}) # emits: this is 5 print new_style.substitute({'thing':'test'}) # emits: this is test # alternatively, you can pass keyword-arguments to 'substitute': print new_style.substitute(thing=5) # emits: this is 5 print new_style.substitute(thing='test') # emits: this is test
In Python 2.3, a format string for identifier-substitution has to be expressed in a less simple format:
old_style = 'this is %(thing)s'
with the identifier in parentheses after a %
, and an s
right after the closed parenthesis. Then,
you use the %
operator, with the
format string on the left of the operator, and a dictionary on the
right:
print old_style % {'thing':5} # emits: this is 5 print old_style % {'thing':'test'} # emits: this is test
Of course, this code keeps working in Python 2.4, too. However,
the new string.Template
class
offers a simpler alternative.
When you build a string.Template
instance, you may include a
dollar sign ($
) by doubling it, and
you may have the interpolated identifier immediately followed by
letters or digits by enclosing it in curly braces ({ }
). Here is an example that requires both
of these refinements:
form_letter = '''Dear $customer, I hope you are having a great time. If you do not find Room $room to your satisfaction, let us know. Please accept this $$5 coupon. Sincerely, $manager ${name}Inn''' letter_template = string.Template(form_letter) print letter_template.substitute({'name':'Sleepy', 'customer':'Fred Smith', 'manager':'Barney Mills', 'room':307, })
This snippet emits the following output:
Dear Fred Smith, I hope you are having a great time. If you do not find Room 307 to your satisfaction, let us know. Please accept this $5 coupon. Sincerely, Barney Mills SleepyInn
Sometimes, the handiest way to prepare a dictionary to
be used as the argument to the substitute
method is to set local variables,
and then pass as the argument locals(
)
(the artificial dictionary whose keys are the local
variables, each with its value associated):
msg = string.Template('the square of $number is $square') for number in range(10): square = number * number print msg.substitute(locals( ))
Another handy alternative is to pass the values to substitute using keyword argument syntax rather than a dictionary:
msg = string.Template('the square of $number is $square') for i in range(10): print msg.substitute(number=i, square=i*i)
You can even pass both a dictionary and keyword arguments:
msg = string.Template('the square of $number is $square') for number in range(10): print msg.substitute(locals( ), square=number*number)
In case of any conflict between entries in the dictionary and the values explicitly passed as keyword arguments, the keyword arguments take precedence. For example:
msg = string.Template('an $adj $msg') adj = 'interesting' print msg.substitute(locals( ), msg='message') # emits an interesting message
Library Reference docs for string.Template
(2.4 only) and the locals
built-in function.
Credit: Xavier Defrang, Alex Martelli
Sometimes regular expressions afford the fastest
solution even in cases where their applicability is not obvious. The
powerful sub
method of re
objects (from the re
module in the standard library) makes
regular expressions particularly good at performing string
substitutions. Here is a function returning a modified copy of an
input string, where each occurrence of any string that’s a key in a
given dictionary is replaced by the corresponding value in the
dictionary:
import re def multiple_replace(text, adict): rx = re.compile('|'.join(map(re.escape, adict))) def one_xlat(match): return adict[match.group(0)] return rx.sub(one_xlat, text)
This recipe shows how to use the Python standard
re
module to perform single-pass
multiple-string substitution using a dictionary. Let’s say you have a
dictionary-based mapping between strings. The keys are the set of
strings you want to replace, and the corresponding values are the
strings with which to replace them. You could perform the substitution
by calling the string method replace
for each key/value pair in the
dictionary, thus processing and creating a new copy of the entire text
several times, but it is clearly better and faster to do all the
changes in a single pass, processing and creating a copy of the text
only once. re.sub
’s callback
facility makes this better approach quite easy.
First, we have to build a regular expression from the set of
keys we want to match. Such a regular expression has a pattern of the
form a1|a2|...|aN
, made up of the
N
strings to be substituted, joined by
vertical bars, and it can easily be generated using a one-liner, as
shown in the recipe. Then, instead of giving re.sub
a replacement string, we pass it a
callback argument. re.sub
then
calls this object for each match, with a re.MatchObject
instance as its only
argument, and it expects the replacement string for that match as the
call’s result. In our case, the callback just has to look up the
matched text in the dictionary and return the corresponding
value.
The function multiple_replace
presented
in the recipe recomputes the regular expression and redefines the
one_xlat
auxiliary function each time you call it.
Often, you must perform substitutions on multiple strings based on the
same, unchanging translation dictionary and would prefer to pay these
setup prices only once. For such needs, you may prefer the following
closure-based approach:
import re def make_xlat(*args, **kwds): adict = dict(*args, **kwds) rx = re.compile('|'.join(map(re.escape, adict))) def one_xlat(match): return adict[match.group(0)] def xlat(text): return rx.sub(one_xlat, text) return xlat
You can call make_xlat
, passing as its
argument a dictionary, or any other combination of arguments you could
pass to built-in dict
in order to
construct a dictionary; make_xlat
returns a
xlat
closure that takes as its only
argument text
the string on which the substitutions
are desired and returns a copy of text
with all the
substitutions performed.
Here’s a usage example for each half of this recipe. We would normally have such an example as a part of the same .py source file as the functions in the recipe, so it is guarded by the traditional Python idiom that runs it if and only if the module is called as a main script:
if _ _name_ _ == "_ _main_ _": text = "Larry Wall is the creator of Perl" adict = { "Larry Wall" : "Guido van Rossum", "creator" : "Benevolent Dictator for Life", "Perl" : "Python", } print multiple_replace(text, adict) translate = make_xlat(adict) print translate(text)
Substitutions such as those performed by this recipe are often
intended to operate on entire words, rather than on arbitrary
substrings. Regular expressions are good at picking up the beginnings
and endings of words, thanks to the special sequence r'
‘. We can easily make customized
versions of either multiple_replace
or
make_xlat
by simply changing the one line in which
each of them builds and assigns the regular expression object
rx
into a slightly different form:
rx = re.compile(r'%s' % r'|'.join(map(re.escape, adict)))
The rest of the code is just the same as shown earlier in this recipe. However, this sameness is not necessarily good news: it suggests that if we need many similarly customized versions, each building the regular expression in slightly different ways, we’ll end up doing a lot of copy-and-paste coding, which is the worst form of code reuse, likely to lead to high maintenance costs in the future.
A key rule of good coding is: “once, and only once!” When we
notice that we are duplicating code, we should notice this symptom as
a “code smell,” and refactor our code for better reuse. In this case,
for ease of customization, we need a class rather than a function or
closure. For example, here’s how to write a class that works very
similarly to make_xlat
but can be customized by
subclassing and overriding:
class make_xlat: def _ _init_ _(self, *args, **kwds): self.adict = dict(*args, **kwds) self.rx = self.make_rx( ) def make_rx(self): return re.compile('|'.join(map(re.escape, self.adict))) def one_xlat(self, match): return self.adict[match.group(0)] def _ _call_ _(self, text): return self.rx.sub(self.one_xlat, text)
This is a “drop-in replacement” for the function of the same
name: in other words, a snippet such as the one we showed, with the
if _ _name_ _ == '_ _main_ _
'
guard, works identically when make_xlat
is this class
rather than the previously shown function. The function is simpler and
faster, but the class’ important advantage is that it can easily be
customized in the usual object-oriented way—subclassing it, and
overriding some method. To translate by whole words, for example, all
we need to code is:
class make_xlat_by_whole_words(make_xlat): def make_rx(self): return re.compile(r'%s' % r'|'.join(map(re.escape, self.adict)))
Ease of customization by subclassing and overriding helps you avoid copy-and-paste coding, and this is sometimes an excellent reason to prefer object-oriented structures over simpler functional structures, such as closures. Of course, just because some functionality is packaged as a class doesn’t magically make it customizable in just the way you want. Customizability also requires some foresight in dividing the functionality into separately overridable methods that correspond to the right pieces of overall functionality. Fortunately, you don’t have to get it right the first time; when code does not have the optimal internal structure for the task at hand (in this specific example, for reuse by subclassing and selective overriding), you can and should refactor the code so that its internal structure serves your needs. Just make sure you have a suitable battery of tests ready to run to ensure that your refactoring hasn’t broken anything, and then you can refactor to your heart’s content. See http://www.refactoring.com for more information on the important art and practice of refactoring.
Documentation for the re
module in the Library Reference and
Python in a Nutshell; the Refactoring home page
(http://www.refactoring.com).
For a certain string s
, you must
check whether s
has any of several endings;
in other words, you need a handy, elegant equivalent of s.endswith(end1) or s.endswith(end2) or
s.endswith(end3)
and so on.
The itertools.imap
function
is just as handy for this task as for many of a similar nature:
import itertools def anyTrue(predicate, sequence): return True in itertools.imap(predicate, sequence) def endsWith(s, *endings): return anyTrue(s.endswith, endings)
A typical use for endsWith
might be to print
all names of image files in the current directory:
import os for filename in os.listdir('.'): if endsWith(filename, '.jpg', '.jpeg', '.gif'): print filename
The same general idea shown in this recipe’s Solution is easily
applied to other tasks related to checking a string for any of several
possibilities. The auxiliary function anyTrue
is
general and fast, and you can pass it as its first argument (the
predicate) other bound methods, such as
s.startswith
or s._ _contains_ _
. Indeed, perhaps it would
be better to do without the helper function
endsWith
—after all, directly coding
if anyTrue(filename.endswith, (".jpg", ".gif", ".png")):
seems to be already readable enough.
This recipe originates from a discussion on news:comp.lang.python. and summarizes inputs from many people, including Raymond Hettinger, Chris Perkins, Bengt Richter and others.
Library Reference and Python
in a Nutshell docs for itertools
and string methods.
Credit: Holger Krekel
Python has a first class unicode
type that you can use in place of
the plain bytestring str
type. It’s
easy, once you accept the need to explicitly convert between a
bytestring and a Unicode string:
>>> german_ae = unicode('xc3xa4', 'utf8')
Here german_ae
is a unicode
string representing the German
lowercase a with umlaut (i.e., diaeresis) character “ä”. It has been
constructed from interpreting the bytestring 'xc3xa4
' according to the specified UTF-8
encoding. There are many encodings, but UTF-8 is often used because it
is universal (UTF-8 can encode any Unicode string) and yet fully
compatible with the 7-bit ASCII set (any ASCII bytestring is a correct
UTF-8-encoded string).
Once you cross this barrier, life is easy! You can manipulate
this Unicode string in practically the same way as a plain str
string:
>>> sentence = "This is a " + german_ae >>> sentence2 = "Easy!" >>> para = ". ".join([sentence, sentence2])
Note that para
is a Unicode
string, because operations between a
unicode
string and a bytestring
always result in a unicode
string—unless they fail and raise an exception:
>>> bytestring = 'xc3xa4' # Uuh, some non-ASCII bytestring! >>> german_ae += bytestringUnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in
position 0: ordinal not in range(128)
The byte '0xc3
' is not a
valid character in the 7-bit ASCII encoding, and Python refuses to
guess an encoding. So, being explicit about encodings is the crucial
point for successfully using Unicode strings with Python.
Unicode is easy to handle in Python, if you respect a few guidelines and learn to deal with common problems. This is not to say that an efficient implementation of Unicode is an easy task. Luckily, as with other hard problems, you don’t have to care much: you can just use the efficient implementation of Unicode that Python provides.
The most important issue is to fully accept the distinction
between a bytestring and a unicode
string. As exemplified in this recipe’s solution, you often need to
explicitly construct a unicode
string by providing a bytestring and an encoding. Without an encoding,
a bytestring is basically meaningless, unless you happen to be lucky
and can just assume that the bytestring is text in ASCII.
The most common problem with using Unicode in Python arises when
you are doing some text manipulation where only some of your strings
are unicode
objects and others are
bytestrings. Python makes a shallow attempt to implicitly convert your
bytestrings to Unicode. It usually assumes an ASCII encoding, though,
which gives you UnicodeDecodeError
exceptions if you actually have non-ASCII bytes somewhere. UnicodeDecodeError
tells you that you mixed
Unicode and bytestrings in such a way that Python cannot (doesn’t even
try to) guess the text your bytestring might represent.
Developers from many big Python projects have come up with
simple rules of thumb to prevent such runtime UnicodeDecodeError
s, and the rules may be
summarized into one sentence: always do the conversion at IO barriers.
To express this same concept a bit more extensively:
Whenever your program receives text data “from the outside”
(from the network, from a file, from user input, etc.), construct
unicode
objects immediately.
Find out the appropriate encoding, for example, from an HTTP
header, or look for an appropriate convention to determine the
encoding to use.
Whenever your program sends text data “to the outside” (to
the network, to some file, to the user, etc.), determine the
correct encoding, and convert your text to a bytestring with that
encoding. (Otherwise, Python attempts to convert Unicode to an
ASCII bytestring, likely producing UnicodeEncodeError
s, which are just the
converse of the UnicodeDecodeError
s previously
mentioned).
With these two rules, you will solve most Unicode problems. If
you still get UnicodeError
s of
either kind, look for where you forgot to properly construct a
unicode
object, forgot to properly
convert back to an encoded bytestring, or ended up using an
inappropriate encoding due to some mistake. (It is quite possible that
such encoding mistakes are due to the user, or some other program that
is interacting with yours, not following the proper encoding rules or
conventions.)
In order to convert a Unicode string back to an encoded bytestring, you usually do something like:
>>> bytestring = german_ae.decode('latin1')
>>> bytestring'xe4'
Now bytestring
is a German ae
character in the 'latin1
' encoding.
Note how 'xe4
' (in Latin1) and the
previously shown 'xc3xa4
' (in
UTF-8) represent the same German character, but in different
encodings.
By now, you can probably imagine why Python refuses to guess
among the hundreds of possible encodings. It’s a crucial design
choice, based on one of the Zen of Python
principles: “In the face of ambiguity, resist the temptation to
guess.” At any interactive Python shell prompt, enter the statement
import this
to read all of the
important principles that make up the Zen of
Python.
Unicode is a huge topic, but a recommended book is
Unicode: A Primer, by Tony Graham (Hungry
Minds, Inc.)—details are available at http://www.menteith.com/unicode/primer/; and a
short but complete article from Joel Spolsky, “The Absolute Minimum
Every Software Developer Absolutely, Positively Must Know About
Unicode and Character Sets (No Excuses)!,” located at http://www.joelonsoftware.com/articles/Unicode.html.
See also the Library Reference and
Python in a Nutshell documentation about the
built-in str
and unicode
types and modules unidata
and codecs
; also, Recipe 1.21 and Recipe 1.22.
Credit: David Ascher, Paul Prescod
Unicode strings can be encoded in plain strings in a variety of ways, according to whichever encoding you choose:
unicodestring = u"Hello world" # Convert Unicode to plain Python string: "encode" utf8string = unicodestring.encode("utf-8") asciistring = unicodestring.encode("ascii") isostring = unicodestring.encode("ISO-8859-1") utf16string = unicodestring.encode("utf-16") # Convert plain Python string to Unicode: "decode" plainstring1 = unicode(utf8string, "utf-8") plainstring2 = unicode(asciistring, "ascii") plainstring3 = unicode(isostring, "ISO-8859-1") plainstring4 = unicode(utf16string, "utf-16") assert plainstring1 == plainstring2 == plainstring3 == plainstring4
If you find yourself dealing with text that contains non-ASCII characters, you have to learn about Unicode—what it is, how it works, and how Python uses it. The preceding Recipe 1.20 offers minimal but crucial practical tips, and this recipe tries to offer more perspective.
You don’t need to know everything about Unicode to be able to solve real-world problems with it, but a few basic tidbits of knowledge are indispensable. First, you must understand the difference between bytes and characters. In older, ASCII-centric languages and environments, bytes and characters are treated as if they were the same thing. A byte can hold up to 256 different values, so these environments are limited to dealing with no more than 256 distinct characters. Unicode, on the other hand, has tens of thousands of characters, which means that each Unicode character takes more than one byte; thus you need to make the distinction between characters and bytes.
Standard Python strings are really bytestrings, and a Python character, being such a string of length 1, is really a byte. Other terms for an instance of the standard Python string type are 8-bit string and plain string. In this recipe we call such instances bytestrings, to remind you of their byte orientation.
A Python Unicode character is an abstract object big enough to
hold any character, analogous to Python’s long integers. You don’t
have to worry about the internal representation; the representation of
Unicode characters becomes an issue only when you are trying to send
them to some byte-oriented function, such as the write
method of files or the send
method of network sockets. At that
point, you must choose how to represent the characters as bytes.
Converting from Unicode to a bytestring is called
encoding the string. Similarly, when you load
Unicode strings from a file, socket, or other byte-oriented object,
you need to decode the strings from bytes to
characters.
Converting Unicode objects to bytestrings can be achieved in
many ways, each of which is called an encoding.
For a variety of historical, political, and technical reasons, there
is no one “right” encoding. Every encoding has a case-insensitive
name, and that name is passed to the encode
and decode
methods as a parameter. Here are a
few encodings you should know about:
The UTF-8 encoding can handle any Unicode character. It is also backwards compatible with ASCII, so that a pure ASCII file can also be considered a UTF-8 file, and a UTF-8 file that happens to use only ASCII characters is identical to an ASCII file with the same characters. This property makes UTF-8 very backwards-compatible, especially with older Unix tools. UTF-8 is by far the dominant encoding on Unix, as well as the default encoding for XML documents. UTF-8’s primary weakness is that it is fairly inefficient for eastern-language texts.
The UTF-16 encoding is favored by Microsoft operating systems and the Java environment. It is less efficient for western languages but more efficient for eastern ones. A variant of UTF-16 is sometimes known as UCS-2.
The ISO-8859 series of encodings are supersets of ASCII, each able to deal with 256 distinct characters. These encodings cannot support all of the Unicode characters; they support only some particular language or family of languages. ISO-8859-1, also known as “Latin-1”, covers most western European and African languages, but not Arabic. ISO-8859-2, also known as “Latin-2”, covers many eastern European languages such as Hungarian and Polish. ISO-8859-15, very popular in Europe these days, is basically the same as ISO-8859-1 with the addition of the Euro currency symbol as a character.
If you want to be able to encode all Unicode characters, you’ll probably want to use UTF-8. You will need to deal with the other encodings only when you are handed data in those encodings created by some other application or input device, or vice versa, when you need to prepare data in a specified encoding to accommodate another application downstream of yours, or an output device. In particular, Recipe 1.22 shows how to handle the case in which the downstream application or device is driven from your program’s standard output stream.
Unicode is a huge topic, but a recommended book is Tony Graham,
Unicode: A Primer (Hungry Minds)—details are
available at http://www.menteith.com/unicode/primer/; and a
short, but complete article from Joel Spolsky, “The Absolute Minimum
Every Software Developer Absolutely, Positively Must Know About
Unicode and Character Sets (No Excuses)!” is located at http://www.joelonsoftware.com/articles/Unicode.html.
See also the Library Reference and
Python in a Nutshell documentation about the
built-in str
and unicode
types, and modules unidata
and codecs
; also, Recipe 1.20 and Recipe 1.22.
You want to print Unicode strings to standard output (e.g., for debugging), but they don’t fit in the default encoding.
Wrap the sys.stdout
stream with a converter, using the codecs
module of Python’s standard library.
For example, if you know your output is going to a terminal that
displays characters according to the ISO-8859-1 encoding, you can
code:
import codecs, sys sys.stdout = codecs.lookup('iso8859-1')[-1](sys.stdout)
Unicode strings live in a large space, big enough for all of the
characters in every language worldwide, but thankfully the internal
representation of Unicode strings is irrelevant for users of Unicode.
Alas, a file stream, such as sys.stdout
, deals with bytes and has an
encoding associated with it. You can change the default encoding that
is used for new files by modifying the site
module. That, however, requires
changing your entire Python installation, which is likely to confuse
other applications that may expect the encoding you originally
configured Python to use (typically the Python standard encoding,
which is ASCII). Therefore, this kind of modification is
not to be recommended.
This recipe takes a sounder approach: it rebinds sys.stdout
as a stream that expects Unicode
input and outputs it in ISO-8859-1 (also known as “Latin-1”). This
approach doesn’t change the encoding of any previous references to
sys.stdout
, as illustrated here.
First, we keep a reference to the original, ASCII-encoded sys.stdout
:
>>> old = sys.stdout
Then, we create a Unicode string that wouldn’t normally be able
to go through sys.stdout
:
>>> char = u"N{LATIN SMALL LETTER A WITH DIAERESIS}" >>> print charTraceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeError: ASCII encoding error: ordinal not in range(128)
If you don’t get an error from this operation, it’s because
Python thinks it knows which encoding your “terminal” is using (in
particular, Python is likely to use the right encoding if your
“terminal” is IDLE, the free development environment that comes with
Python). But, suppose you do get this error, or get no error but the
output is not the character you expected, because your “terminal” uses
UTF-8 encoding and Python does not know about it. When that is the
case, we can just wrap sys.stdout
in the codecs
stream writer for
UTF-8, which is a much richer encoding, then rebind sys.stdout
to it and try again:
>>> sys.stdout = codecs.lookup('utf-8')[-1](sys.stdout) >>> print char ä
This approach works only if your “terminal”, terminal emulator, or other window in which you’re running the interactive Python interpreter supports the UTF-8 encoding, with a font rich enough to display all the characters you need to output. If you don’t have such a program or device available, you may be able to find a suitable one for your platform in the form of a free program downloadable from the Internet.
Python tries to determine which encoding your “terminal” is
using and sets that encoding’s name as attribute sys.stdout.encoding
. Sometimes (alas, not
always) it even manages to get it right. IDLE already wraps your
sys.stdout
, as suggested in this
recipe, so, within the environment’s interactive Python shell, you can
directly print Unicode strings.
Documentation for the codecs
and site
modules, and setdefaultencoding
in module sys
, in the Library
Reference and Python in a Nutshell;
Recipe 1.20 and Recipe 1.21.
Credit: David Goodger, Peter Cogolo
You want to encode Unicode text for output in HTML, or some other XML application, using a limited but popular encoding such as ASCII or Latin-1.
Python provides an encoding error handler named xmlcharrefreplace
, which replaces all
characters outside of the chosen encoding with XML numeric character
references:
def encode_for_xml(unicode_data, encoding='ascii'): return unicode_data.encode(encoding, 'xmlcharrefreplace')
You could use this approach for HTML output, too, but you might
prefer to use HTML’s symbolic entity references
instead. For this purpose, you need to define and register a
customized encoding error handler. Implementing that handler is made
easier by the fact that the Python Standard Library includes a module
named htmlentitydefs
that holds
HTML entity definitions:
import codecs from htmlentitydefs import codepoint2name def html_replace(exc): if isinstance(exc, (UnicodeEncodeError, UnicodeTranslateError)): s = [ u'&%s;' % codepoint2name[ord(c)] for c in exc.object[exc.start:exc.end] ] return ''.join(s), exc.end else: raise TypeError("can't handle %s" % exc._ _name_ _) codecs.register_error('html_replace', html_replace)
After registering this error handler, you can optionally write a function to wrap its use:
def encode_for_html(unicode_data, encoding='ascii'): return unicode_data.encode(encoding, 'html_replace')
As with any good Python module, this module would normally
proceed with an example of its use, guarded by an if _ _name_ _ == '_ _main_ _
' test:
if _ _name_ _ == '_ _main_ _': # demo data = u''' <html> <head> <title>Encoding Test</title> </head> <body> <p>accented characters: <ul> <li>xe0 (a + grave) <li>xe7 (c + cedilla) <li>xe9 (e + acute) </ul> <p>symbols: <ul> <li>xa3 (British pound) <li>u20ac (Euro) <li>u221e (infinity) </ul> </body></html> ''' print encode_for_xml(data) print encode_for_html(data)
If you run this module as a main script, you will then see such
output as (from function encode_for_xml
):
<li>à (a + grave)
<li>ç (c + cedilla)
<li>é (e + acute)...
<li>£ (British pound)
<li>€ (Euro)
<li>∞ (infinity)
as well as (from function
encode_for_html
):
<li>à (a + grave)
<li>ç (c + cedilla)
<li>é (e + acute)...
<li>£ (British pound)
<li>€ (Euro)
<li>∞ (infinity)
There is clearly a niche for each case, since
encode_for_xml
is more general (you can use it for
any XML application, not just HTML), but
encode_for_html
may produce output that’s easier to
read—should you ever need to look at it directly, edit it further, and
so on. If you feed either form to a browser, you should view it in
exactly the same way. To visualize both forms of encoding in a
browser, run this recipe’s module as a main script, redirect the
output to a disk file, and use a text editor to separate the two
halves before you view them with a browser. (Alternatively, run the
script twice, once commenting out the call to
encode_for_xml
, and once commenting out the call to
encode_for_html
.)
Remember that Unicode data must always be encoded before being
printed or written out to a file. UTF-8 is an ideal encoding, since it
can handle any Unicode character. But for many users and applications,
ASCII or Latin-1 encodings are often preferred over UTF-8. When the
Unicode data contains characters that are outside of the given
encoding (e.g., accented characters and most symbols are not encodable
in ASCII, and the “infinity” symbol is not encodable in Latin-1),
these encodings cannot handle the data on their own. Python supports a
built-in encoding error handler called xmlcharrefreplace
, which replaces
unencodable characters with XML numeric character references, such as
∞
for the “infinity”
symbol. This recipe shows how to write and register another similar
error handler, html_replace
, specifically for
producing HTML output. html_replace
replaces
unencodable characters with more readable HTML symbolic entity
references, such as ∞
for
the “infinity” symbol. html_replace
is less general
than xmlcharrefreplace
, since it
does not support all Unicode characters and
cannot be used with non-HTML applications; however, it can still be
useful if you want HTML output that is as readable as possible in a
“view page source” context.
Neither of these error handlers makes sense for output that is
neither HTML nor some other form of XML. For example, TeX and other
markup languages do not recognize XML numeric character references.
However, if you know how to build an arbitrary character reference for
such a markup language, you may modify the example error handler
html_replace
shown in this recipe’s Solution to code
and register your own encoding error handler.
An alternative (and very effective!) way to perform encoding of
Unicode data into a file, with a given encoding and error handler of
your choice, is offered by the codecs
module in Python’s standard
library:
outfile = codecs.open('out.html', mode='w', encoding='ascii', errors='html_replace')
You can now use outfile.write(unicode_data)
for any
arbitrary Unicode string unicode_data
, and
all the encoding and error handling will be taken care of
transparently. When your output is finished, of course, you should
call outfile.close( )
.
Library Reference and Python
in a Nutshell docs for modules codecs
and htmlentitydefs
.
Credit: Dale Strickland-Clark, Peter Cogolo, Mark McMahon
You want to treat some strings so that all comparisons and lookups are case-insensitive, while all other uses of the strings preserve the original case.
The best solution is to wrap the specific strings in question
into a suitable subclass of str
:
class iStr(str): """ Case insensitive string class. Behaves just like str, except that all comparisons and lookups are case insensitive. """ def _ _init_ _(self, *args): self._lowered = str.lower(self) def _ _repr_ _(self): return '%s(%s)' % (type(self)._ _name_ _, str._ _repr_ _(self)) def _ _hash_ _(self): return hash(self._lowered) def lower(self): return self._lowered def _make_case_insensitive(name): ''' wrap one method of str into an iStr one, case-insensitive ''' str_meth = getattr(str, name) def x(self, other, *args): ''' try lowercasing 'other', which is typically a string, but be prepared to use it as-is if lowering gives problems, since strings CAN be correctly compared with non-strings. ''' try: other = other.lower( ) except (TypeError, AttributeError, ValueError): pass return str_meth(self._lowered, other, *args) # in Python 2.4, only, add the statement: x.func_name = name setattr(iStr, name, x) # apply the _make_case_insensitive function to specified methods for name in 'eq lt le gt gt ne cmp contains'.split( ): _make_case_insensitive('_ _%s_ _' % name) for name in 'count endswith find index rfind rindex startswith'.split( ): _make_case_insensitive(name) # note that we don't modify methods 'replace', 'split', 'strip', ... # of course, you can add modifications to them, too, if you prefer that. del _make_case_insensitive # remove helper function, not needed any more
Some implementation choices in class
iStr
are worthy of notice. First, we choose to
generate the lowercase version once and for all, in method _ _init_ _
, since we envision that in
typical uses of iStr
instances, this version will be
required repeatedly. We hold that version in an attribute that is
private, but not overly so (i.e., has a name that begins with one
underscore, not two), because if iStr
gets subclassed
(e.g., to make a more extensive version that also offers
case-insensitive splitting, replacing, etc., as the comment in the
“Solution” suggests), iStr
’s subclasses are quite
likely to want to access this crucial “implementation detail” of
superclass iStr
!
We do not offer “case-insensitive” versions of such methods as
replace
, because it’s anything but
clear what kind of input-output relation we might want to establish in
the general case. Application-specific subclasses may therefore be the
way to provide this functionality in ways appropriate to a given
application. For example, since the replace
method is not wrapped, calling
replace
on an instance of
iStr
returns an instance of str
, not of
iStr
. If that is a problem in your application, you
may want to wrap all iStr
methods that return
strings, simply to ensure that the results are made into instances of
iStr
. For that purpose, you need another, separate
helper function, similar but not identical to the
_make_case_insensitive
one shown in the
“Solution”:
def _make_return_iStr(name): str_meth = getattr(str, name) def x(*args): return iStr(str_meth(*args)) setattr(iStr, name, x)
and you need to call this helper function
_make_return_iStr
on all the names of relevant string
methods returning strings such as:
for name in 'center ljust rjust strip lstrip rstrip'.split( ): _make_return_iStr(name)
Strings have about 20 methods (including special methods such as
_ _add_ _
and _ _mul_ _
) that you should consider wrapping
in this way. You can also wrap in this way some additional methods,
such as split
and join
, which may require special handling,
and others, such as encode
and
decode
, that you cannot deal with
unless you also define a case-insensitive unicode
subtype. In practice, one can hope
that not every single one of these methods will prove problematic in a
typical application. However, as you can see, the very functional
richness of Python strings makes it a bit of work to customize string
subtypes fully, in a general way without depending on the needs of a
specific application.
The implementation of iStr
is careful to avoid
the boilerplate code (meaning repetitious and therefore bug-prone
code) that we’d need if we just overrode each needed method of
str
in the normal way, with
def
statements in the class body. A
custom metaclass or other such advanced technique would offer no
special advantage in this case, so the boilerplate avoidance is simply
obtained with one helper function that generates and installs wrapper
closures, and two loops using that function, one for normal methods
and one for special ones. The loops need to be placed
after the class
statement, as we do in this recipe’s
Solution, because they need to modify the class object
iStr
, and the class object doesn’t exist yet (and
thus cannot be modified) until the class
statement has completed.
In Python 2.4, you can reassign the func_name
attribute of a function object,
and in this case, you should do so to get clearer and more readable
results when introspection (e.g., the help
function in an interactive interpreter
session) is applied to an iStr
instance. However, Python 2.3 considers attribute func_name
of function objects to be
read-only; therefore, in this recipe’s Solution, we have indicated
this possibility only in a comment, to avoid losing Python 2.3
compatibility over such a minor issue.
Case-insensitive (but case-preserving) strings have many uses,
from more tolerant parsing of user input, to filename matching on
filesystems that share this characteristic, such as all of Windows
filesystems and the Macintosh default filesystem. You might easily
find yourself creating a variety of “case-insensitive” container
types, such as dictionaries, lists, sets, and so on—meaning containers
that go out of their way to treat string-valued keys or items as if
they were case-insensitive. Clearly a better architecture is to factor
out the functionality of “case-insensitive” comparisons and lookups
once and for all; with this recipe in your toolbox, you can just add
the required wrapping of strings into iStr
instances
wherever you may need it, including those times when you’re making
case-insensitive container types.
For example, a list whose items are basically strings, but are
to be treated case-insensitively (for sorting purposes and in such
methods as count
and index
), is reasonably easy to build on top
of iStr
:
class iList(list): def _ _init_ _(self, *args): list._ _init_ _(self, *args) # rely on _ _setitem_ _ to wrap each item into iStr... self[:] = self wrap_each_item = iStr def _ _setitem_ _(self, i, v): if isinstance(i, slice): v = map(self.wrap_each_item, v) else: v = self.wrap_each_item(v) list._ _setitem_ _(self, i, v) def append(self, item): list.append(self, self.wrap_each_item(item)) def extend(self, seq): list.extend(self, map(self.wrap_each_item, seq))
Essentially, all we’re doing is ensuring that every item that
gets into an instance of iList
gets wrapped by a call
to iStr
, and everything else takes care of
itself.
Incidentally, this example class iList
is
accurately coded so that you can easily make customized subclasses of
iList
to accommodate application-specific subclasses
of iStr
: all such a customized subclass of
iList
needs to do is override the single class-level
member named wrap_each_item
.
Library Reference and Python
in a Nutshell sections on str
, string methods, and special methods
used in comparisons and hashing.
Credit: Brent Burley, Mark Moraes
You need to visualize HTML documents as text, with support for bold and underlined display on your Unix terminal.
The simplest approach is to code a filter
script, taking HTML on standard input and emitting text and terminal
control sequences on standard output. Since this recipe only targets
Unix, we can get the needed terminal control sequences from the “Unix”
command tput, via the function
popen
of the Python Standard
Library module os
:
#!/usr/bin/env python import sys, os, htmllib, formatter # use Unix tput to get the escape sequences for bold, underline, reset set_bold = os.popen('tput bold').read( ) set_underline = os.popen('tput smul').read( ) perform_reset = os.popen('tput sgr0').read( ) class TtyFormatter(formatter.AbstractFormatter): ''' a formatter that keeps track of bold and italic font states, and emits terminal control sequences accordingly. ''' def _ _init_ _(self, writer): # first, as usual, initialize the superclass formatter.AbstractFormatter._ _init_ _(self, writer) # start with neither bold nor italic, and no saved font state self.fontState = False, False self.fontStack = [ ] def push_font(self, font): # the `font' tuple has four items, we only track the two flags # about whether italic and bold are active or not size, is_italic, is_bold, is_tt = font self.fontStack.append((is_italic, is_bold)) self._updateFontState( ) def pop_font(self, *args): # go back to previous font state try: self.fontStack.pop( ) except IndexError: pass self._updateFontState( ) def updateFontState(self): # emit appropriate terminal control sequences if the state of # bold and/or italic(==underline) has just changed try: newState = self.fontStack[-1] except IndexError: newState = False, False if self.fontState != newState: # relevant state change: reset terminal print perform_reset, # set underine and/or bold if needed if newState[0]: print set_underline, if newState[1]: print set_bold, # remember the two flags as our current font-state self.fontState = newState # make writer, formatter and parser objects, connecting them as needed myWriter = formatter.DumbWriter( ) if sys.stdout.isatty( ): myFormatter = TtyFormatter(myWriter) else: myFormatter = formatter.AbstractFormatter(myWriter) myParser = htmllib.HTMLParser(myFormatter) # feed all of standard input to the parser, then terminate operations myParser.feed(sys.stdin.read( )) myParser.close( )
The basic formatter.AbstractFormatter
class, offered
by the Python Standard Library, should work just about anywhere. On
the other hand, the refinements in the TtyFormatter
subclass that’s the focus of this recipe depend on using a Unix-like
terminal, and more specifically on the availability of the tput Unix command to obtain information on
the escape sequences used to get bold or underlined output and to
reset the terminal to its base state.
Many systems that do not have Unix certification, such as Linux
and Mac OS X, do have a perfectly workable tput command and therefore can use this
recipe’s TtyFormatter
subclass just fine. In other
words, you can take the use of the word “Unix” in this recipe just as
loosely as you can take it in just about every normal discussion: take
it as meaning “*ix,” if you will.
If your “terminal” emulator supports other escape sequences for
controlling output appearance, you should be able to adapt this
TtyFormatter
class accordingly. For example, on
Windows, a cmd.exe command window
should, I’m told, support standard ANSI escape sequences, so you could
choose to hard-code those sequences if Windows is the platform on
which you want to run your version of this script.
In many cases, you may prefer to use other existing Unix commands, such as lynx -dump -, to get richer formatting than this recipe provides. However, this recipe comes in quite handy when you find yourself on a system that has a Python installation but lacks such other helpful commands as lynx.
Library Reference and Python
in a Nutshell docs on the formatter
and htmllib
modules; man tput on a Unix or Unix-like system for
more information about the tput
command.