Credit: Alex Martelli
You want to access portions of a string. For example, you’ve read a fixed-width record and want to extract the record’s fields.
Slicing is great, of course, but it only does one field at a time:
afield = theline[3:8]
If you need to think in terms of field length,
struct.unpack
may be appropriate. Here’s an example of getting a
five-byte string, skipping three bytes, getting two eight-byte
strings, and then getting the rest:
import struct # Get a 5-byte string, skip 3, get two 8-byte strings, then all the rest: baseformat = "5s 3x 8s 8s" numremain = len(theline)-struct.calcsize(baseformat) format = "%s %ds" % (baseformat, numremain) leading, s1, s2, trailing = struct.unpack(format, theline)
If you need to split at five-byte boundaries, here’s how you could do it:
numfives, therest = divmod(len(theline), 5) form5 = "%s %dx" % ("5s "*numfives, therest) fivers = struct.unpack(form5, theline)
Chopping a string into individual characters is of course easier:
chars = list(theline)
If you prefer to think of your data as being cut up at specific columns, slicing within list comprehensions may be handier:
cuts = [8,14,20,26,30] pieces = [ theline[i:j] for i, j in zip([0]+cuts, cuts+[sys.maxint]) ]
This recipe was inspired by Recipe 1.1 in the Perl Cookbook. Python’s slicing takes the
place of Perl’s substr
.
Perl’s built-in
unpack
and
Python’s struct.unpack
are
similar. Perl’s is slightly handier, as it accepts a
field length of *
for the last field to mean all
the rest. In Python, we have to compute and insert the exact length
for either extraction or skipping. This isn’t a
major issue, because such extraction tasks will usually be
encapsulated into small, probably local functions.
Memoizing,
or automatic caching, may help with performance if the function is
called repeatedly, since it allows you to avoid redoing the
preparation of the format for the struct unpacking. See also
Recipe 17.8.
In a purely Python context, the point of this recipe is to remind you
that struct.unpack
is often viable, and sometimes
preferable, as an alternative to string slicing (not quite as often
as unpack
versus substr
in
Perl, given the lack of a *
-valued field length,
but often enough to be worth keeping in mind).
Each of these snippets is, of course, best encapsulated in a function. Among other advantages, encapsulation ensures we don’t have to work out the computation of the last field’s length on each and every use. This function is the equivalent of the first snippet in the solution:
def fields(baseformat, theline, lastfield=None): numremain = len(theline)-struct.calcsize(baseformat) format = "%s %d%s" % (baseformat, numremain, lastfield and "s" or "x") return struct.unpack(format, theline)
If this function is called in a loop, caching with a key of
(baseformat,
len(theline), lastfield)
may be useful here because it can offer an easy
speed-up.
The function equivalent of the second snippet in the solution is:
def split_by(theline, n, lastfield=None): numblocks, therest = divmod(len(theline), n) baseblock = "%d%s"%(n, lastfield and "s" or "x") format = "%s %dx"%(baseblock*numblocks, therest)
And for the third snippet:
def split_at(theline, cuts, lastfield=None): pieces = [ theline[i:j] for i, j in zip([0]+cuts, cuts) ] if lastfield: pieces.append(theline(cuts[-1:])) return pieces
In each of these functions, a decision worth noticing (and, perhaps,
worth criticizing) is that of having a
lastfield=None
optional parameter. This reflects
the observation that while we often want to skip the last,
unknown-length subfield, sometimes we want to retain it instead. The
use of lastfield
in the expression
lastfield and
"s" or "x"
(equivalent to C’s
lastfield?'s':'c'
) saves an
if/else
, but it’s unclear whether
the saving is worth it. "sx"[not lastfield]
and
other similar alternatives are roughly equivalent in this respect;
see Recipe 17.6. When
lastfield
is false, applying
struct.unpack
to just a prefix of
theline
(specifically,
theline[:struct.calcsize(format)]
) is an
alternative, but it’s not easy to merge with the
case of lastfield
being true, when the format does
need a supplementary field for
len(theline)-struct.calcsize(format)
.
Recipe 17.6 and Recipe 17.8; Perl Cookbook Recipe 1.1.