Chapter 5. Strings

The next major built-in type is the Python string—an ordered collection of characters, used to store and represent text-based information. From a functional perspective, strings can be used to represent just about anything that can be encoded as text: symbols and words (e.g., your name), contents of text files loaded into memory, Internet addresses, Python programs, and so on.

You may have used strings in other languages too; Python’s strings serve the same role as character arrays in languages such as C, but Python’s strings are a somewhat higher level tool than arrays. Unlike C, Python strings come with a powerful set of processing tools. Also unlike languages like C, Python has no special type for single characters (like C’s char), only one-character strings.

Strictly speaking, Python strings are categorized as immutable sequences—meaning that they have a left-to-right positional order (sequence), and cannot be changed in place (immutable). In fact, strings are the first representative of the larger class of objects called sequences. Pay special attention to the operations introduced here, because they will work the same on other sequence types we’ll see later, such as lists and tuples.

Table 5-1 introduces common string literals and operations. Empty strings are written as two quotes with nothing in between, and there are a variety of ways to code strings. For processing, strings support expression operations such as concatenation (combining strings), slicing (extracting sections), indexing (fetching by offset), and so on. Besides expressions, Python also provides a set of string methods that implement common string-specific tasks, as well as a string module that mirrors most string methods.

Table 5-1. Common string literals and operations

Operation

Interpretation

s1 = ''

Empty string

s2 = "spam's"

Double quotes

block = """..."""

Triple-quoted blocks

s3 = r' empspam'

Raw strings

s4 = u'spam'

Unicode Strings

s1 + s2s2 * 3

Concatenate, repeat

s2[i] s2[i:j] len(s2)

Index, slice, length

"a %s parrot" % 'dead'

String formatting

s2.find('pa') s2.replace('pa', 'xx')s1.split( )

String method calls

for x in s2'm' in s2

Iteration, membership

Methods and modules are discussed later in this section. Beyond the core set of string tools, Python also supports more advanced pattern-based string processing with the standard library’s re (regular expression) module, introduced in Chapter 27. This section starts with an overview of string literal forms and basic string expressions, and then looks at more advanced tools such as string methods and formatting.

String Literals

By and large, strings are fairly easy to use in Python. Perhaps the most complicated thing about them is that there are so many ways to write them in your code:

  • Single quotes: 'spa"m'

  • Double quotes: "spa'm"

  • Triple quotes: '''... spam ...''', """... spam ..."""

  • Escape sequences: "s p am"

  • Raw strings: r"C: ew est.spm"

  • Unicode strings: u'eggsu0020spam'

The single- and double-quoted forms are by far the most common; the others serve specialized roles. Let’s take a quick look at each of these options.

Single- and Double-Quoted Strings Are the Same

Around Python strings, single and double quote characters are interchangeable. That is, string literals can be written enclosed in either two single or two double quotes—the two forms work the same, and return the same type of object. For example, the following two strings are identical, once coded:

>>> 'shrubbery', "shrubbery"
('shrubbery', 'shrubbery')

The reason for including both is that it allows you to embed a quote character of the other variety inside a string, without escaping it with a backslash: you may embed a single quote character in a string enclosed in double quote characters, and vice-versa:

>>> 'knight"s', "knight's"
('knight"s', "knight's")

Incidentally, Python automatically concatenates adjacent string literals, although it is almost as simple to add a + operator between them, to invoke concatenation explicitly.

>>> title = "Meaning " 'of' " Life"
>>> title
'Meaning of Life'

Notice in all of these outputs that Python prefers to print strings in single quotes, unless they embed one. You can also embed quotes by escaping them with backslashes:

>>> 'knight's', "knight"s"
("knight's", 'knight"s')

But to understand why, we need to explain how escapes work in general.

Escape Sequences Code Special Bytes

The last example embedded a quote inside a string by preceding it with a backslash. This is representative of a general pattern in strings: backslashes are used to introduce special byte codings, known as escape sequences.

Escape sequences let us embed byte codes in strings that cannot be easily typed on a keyboard. The character , and one or more characters following it in the string literal, are replaced with a single character in the resulting string object, which has the binary value specified by the escape sequence. For example, here is a five-character string that embeds a newline and a tab:

>>> s = 'a
b	c'

The two characters stand for a single character—the byte containing the binary value of the newline character in your character set (usually, ASCII code 10). Similarly, the sequence is replaced with the tab character. The way this string looks when printed depends on how you print it. The interactive echo shows the special characters as escapes, but print interprets them instead:

>>> s
'a
b	c'
>>> print s
a
b       c

To be completely sure how many bytes are in this string, you can use the built-in len function—it returns the actual number of bytes in a string, regardless of how it is displayed.

>>> len(s)
5

This string is five bytes long: an ASCII “a” byte, a newline byte, an ASCII “b” byte, and so on; the original backslash characters are not really stored with the string in memory.

For coding such special bytes, Python recognizes a full set of escape code sequences, listed in Table 5-2. Some sequences allow you to embed absolute binary values into the bytes of a string. For instance, here’s another five-character string that embeds two binary zero bytes:

>>> s = 'abc'
>>> s
'ax00bx00c'
>>> len(s)
5
Table 5-2. String backslash characters

Escape

Meaning

ewline

Ignored (continuation)

\

Backslash (keeps a )

'

Single quote (keeps `)

"

Double quote (keeps “)

a

Bell



Backspace

f

Formfeed

Newline (linefeed)

Carriage return

Horizontal tab

v

Vertical tab

N{ id }

Unicode dbase id

uhhhh

Unicode 16-bit hex

Uhhhh...

Unicode 32-bit hex[1]

xhh

Hex digits value hh

ooo

Octal digits value

Null (doesn’t end string)

other

Not an escape (kept)

[1] The Uhhhh... escape sequence takes exactly eight hexadecimal digits (h); both u and U can be used only in Unicode string literals.

In Python, the zero (null) byte does not terminate a string the way it typically does in C. Instead Python keeps both the string’s length and text in memory. In fact, no character terminates a string in Python; here’s one that is all absolute binary escape codes—a binary 1 and 2 (coded in octal), followed by a binary 3 (coded in hexadecimal):

>>> s = '0102x03'
>>> s
'x01x02x03'
>>> len(s)
3

This becomes more important to know when you process binary data files in Python. Because their contents are represented as string in your scripts, it’s okay to process binary files that contain any sort of binary byte values. More on files in Chapter 7.[2]

Finally, as the last entry in Table 5-2 implies, if Python does not recognize the character after a "" as being a valid escape code, it simply keeps the backslash in the resulting string:

>>> x = "C:pycode"     # keeps  literally
>>> x
'C:\py\code'
>>> len(x)
10

Unless you’re able to commit all of Table 5-2 to memory, you probably shouldn’t rely on this behavior; to code literal backslashes, double up (“\” is an escape for “”), or use raw strings, described in the next section.

Raw Strings Suppress Escapes

As we’ve seen, escape sequences are handy for embedding special byte codes within strings. Sometimes, though, the special treatment of backslashes for introducing escapes can lead to trouble. It’s suprisingly common, for instance, to see Python newcomers in classes trying to open a file with a filename argument that looks something like this:

myfile = open('C:
ew	ext.dat', 'w')

thinking that they will open a file called text.dat in directory C: ew. The problem here is that is taken to stand for a newline character, and is replaced with a tab. In effect, the call tries to open a file named C:(newline)ew(tab)ext.dat, with usually less than stellar results.

This is just the sort of thing that raw strings are useful for. If the letter “r” (uppercase or lowercase) appears just before the opening quote of a string, it turns off the escape mechanism—Python retains your backslashes literally, exactly as you typed them. To fix the filename problem, just remember to add the letter “r” on Windows:

myfile = open(r'C:
ew	ext.dat', 'w')

Because two backslashes are really an escape sequence for one backslash, you can also keep your backslashes by simply doubling-up, without using raw strings:

myfile = open('C:\new\text.dat', 'w')

In fact, Python itself sometimes uses this doubled scheme when it prints strings with embedded backslashes:

>>> path = r'C:
ew	ext.dat'
>>> path                          # Show as Python code.
'C:\new\text.dat'
>>> print path                    # User-friendly format
C:
ew	ext.dat
>>> len(path)                     # String length
15

There really is just one backslash in the string where Python printed two in the first output of this code. As with numeric representation, the default format at the interactive prompt prints results as if they were code, but the print statement provides a more user-friendly format. To verify, check the result of the built-in len function again, to see the number of bytes in the string, independent of display formats. If you count, you’ll see that there really is just one character per backslash for a total of 15.

Besides directory paths on Windows, raw strings are also commonly used for regular expressions (text pattern matching, supported with module re); you’ll meet this feature later in this book. Also note that Python scripts can usually use forward slashes in directory paths on both Windows and Unix, because Python tries to interpret paths portably. Raw strings are useful if you code paths using native Windows backslashes.

Triple Quotes Code Multiline Block Strings

So far, you’ve seen single quotes, double quotes, escapes, and raw strings. Python also has a triple-quoted string literal format, sometimes called a block string, which is a syntactic convenience for coding multiline text data. This form begins with three quotes (of either the single or double variety), is followed by any number of lines of text, and is closed with the same triple quote sequence that opened it. Single and double quotes in the text may be, but do not have to be, escaped. For example:

>>> mantra = """Always look
...  on the bright
... side of life."""
>>>
>>> mantra
'Always look
 on the bright
side of life.'

This string spans three lines (in some interfaces, the interactive prompt changes to “...” on continuation lines; IDLE simply drops down one line). Python collects all the triple-quoted text into a single multiline string, with embedded newline characters ( ) at the places that your code has line breaks. Notice that the second line in the result has a leading space as it did in the literal—what you type is truly what you get.

Triple-quoted strings are handy any time you need multiline text in your program, for example, to code error messages or HTML and XML code. You can embed such blocks directly in your script, without resorting to external text files or explicit concatenation and newline characters.

Unicode Strings Encode Larger Character Sets

The last way to write strings in your scripts is perhaps the most specialized, and the least commonly used. Unicode strings are sometimes called "wide” character strings. Because each character may be represented with more than one byte in memory, Unicode strings allow programs to encode richer character sets than standard strings.

Unicode strings are typically used to support internationalization of applications (sometimes referred to as “i18n”, to compress the 18 characters between the first and last characters of the term). For instance, they allow programmers to directly support European or Asian character sets in Python scripts. Because such character sets have more characters than a single byte can represent, Unicode is normlly used to process these forms of text.

In Python, Unicode strings may be coded in your script by adding the letter “U” (lower or uppercase), just before the opening quote of a string:

>>> u'spam'
u'spam'

Technically, this syntax generates a Unicode string object, which is a different data type than normal strings. However, Python allows you to freely mix Unicode and normal strings in expressions, and converts up to Unicode for mixed-type results (more on + concatenation in the next section):

>>> 'ni' + u'spam'        # Mixed string types
u'nispam'

In fact, Unicode strings are defined to support all the usual string processing operations you’ll meet in the next section, so the difference in types is often trivial to your code. Like normal strings, Unicode may be concatenated, indexed, sliced, matched with the re module, and so on, and cannot be changed in place. If you ever do need to convert between the two types explicitly, you can use the built-in str and unicode functions:

>>> str(u'spam')          # Unicode to normal
'spam'
>>> unicode('spam')       # Normal to unicode
u'spam'

Because Unicode is designed to handle multibyte characters, you can also use the special u and U escapes to encode binary character values that are larger than 8 bits:

>>> u'abx20cd'           # 8-bit/1-byte characters
u'ab cd'
>>> u'abu0020cd'         # 2-byte characters
u'ab cd'
>>> u'abU00000020cd'     # 4-byte characters
u'ab cd'

The first of these embeds the binary code for a space character; its binary value in hexidecimal notation is x20. The second and third do the same, but give the value in 2-byte and 4-byte Unicode escape notation.

Even if you don’t think you will need Unicode, you might use them without knowing it. Because some programming interfaces (e.g., the COM API on Windows) represent text as Unicode, it may find its way into your script as API inputs or results, and you may sometimes need to convert back and forth between normal and Unicode types. Since Python treats the two string types interchangeably in most contexts, the presence of Unicode strings is often transparent to your code—you can largely ignore the fact that text is being passed around as Unicode objects, and use normal strings operations.

Unicode is a useful addition to Python; because it is built-in, it’s easy to handle such data in your scripts when needed. Unfortunately, from this point forward, the Unicode story becomes fairly complex. For example:

  • Unicode objects provide an encode method that converts a Unicode string into a normal 8-bit string using a specific encoding.

  • The built-in function unicode and module codecs support registered Unicode “codecs” (for “COders and DECoders”).

  • The module unicodedata provides access to the Unicode character database.

  • The sys module includes calls for fetching and setting the default Unicode encoding scheme (the default is usually ASCII)

  • You may combine the raw and unicode string formats (e.g., ur'ac').

Because Unicode is a relatively advanced and rarely used tool, we will omit further details in this introductory text. See the Python standard manual for the rest of the Unicode story.

Strings in Action

Once you’ve written a string, you will almost certainly want to do things with it. This section and the next two demonstrate string basics, formatting, and methods.

Basic Operations

Let’s begin by interacting with the Python interpreter to illustrate the basic string operations listed in Table 5-1. Strings can be concatenated using the + operator, and repeated using the * operator:

% python
>>> len('abc')         # Length: number items 
3
>>> 'abc' + 'def'      # Concatenation: a new string
'abcdef'
>>> 'Ni!' * 4          # Repitition: like "Ni!" + "Ni!" + ...
'Ni!Ni!Ni!Ni!'

Formally, adding two string objects creates a new string object, with the contents of its operands joined; repetition is like adding a string to itself a number of times. In both cases, Python lets you create arbitrarily sized strings; there’s no need to predeclare anything in Python, including the sizes of data structures.[3] The len built-in function returns the length of strings (and other objects with a length).

Repetition may seem a bit obscure at first, but it comes in handy in a surprising number of contexts. For example, to print a line of 80 dashes, you can either count up to 80 or let Python count for you:

>>> print '------- ...more... ---'      # 80 dashes, the hard way
>>> print '-'*80                        # 80 dashes, the easy way

Notice that operator overloading is at work here already: we’re using the same + and * operators that are called addition and multiplication when using numbers. Python does the correct operation, because it knows the types of objects being added and multiplied. But be careful: this isn’t quite as liberal as you might expect. For instance, Python doesn’t allow you to mix numbers and strings in + expressions: 'abc'+9 raises an error, instead of automatically converting 9 to a string.

As shown in the last line in Table 5-1, you can also iterate over strings in loops using for statements and test membership with the in expression operator, which is essentially a search:

>>> myjob = "hacker"
>>> for c in myjob: print c,       # Step through items.
...
h a c k e r
>>> "k" in myjob                   # 1 means true (found).
1
>>> "z" in myjob                   # 0 means false (not found).
0

The for loop assigns a variable to successive items in a sequence (here, a string), and executes one or more statements for each item. In effect, the variable c becomes a cursor stepping across the string here. But further details on these examples will be discussed later.

Indexing and Slicing

Because strings are defined as an ordered collection of characters, we can access their components by position. In Python, characters in a string are fetched by indexing—providing the numeric offset of the desired component in square brackets after the string. You get back the one-character string.

As in the C language, Python offsets start at zero and end at one less than the length of the string. Unlike C, Python also lets you fetch items from sequences such as strings using negative offsets. Technically, negative offsets are added to the length of a string to derive a positive offset. You can also think of negative offsets as counting backwards from the end.

>>> S = 'spam'
>>> S[0], S[-2]               # Indexing from front or end
('s', 'a')
>>> S[1:3], S[1:], S[:-1]     # Slicing: extract section
('pa', 'pam', 'spa')

The first line defines a four-character string and assign it the name S. The next line indexes it two ways: S[0] fetches the item at offset 0 from the left (the one-character string 's'), and S[-2] gets the item at offset 2 from the end (or equivalently, at offset (4 + -2) from the front). Offsets and slices map to cells as shown in Figure 5-1.[4]

Using offsets and slices
Figure 5-1. Using offsets and slices

The last line in the example above is our first look at slicing. Probably the best way to think of slicing is that it is a form of parsing (analyzing structure), especially when applied to strings—it allows us to extract an entire section (substring) in a single step. Slices can extract columns of data, chop off leading and trailing text, and more.

Here’s how slicing works. When you index a sequence object such as a string on a pair of offsets seperated by a colon, Python returns a new object containing the contiguous section identified by the offsets pair. The left offset is taken to be the lower bound (inclusive), and the right is the upper bound (noninclusive). Python fetches all items from the lower bound, up to but not including the upper bound, and returns a new object containing the fetched items. If omitted, the left and right bound default to zero, and the length of the object you are slicing, respectively.

For instance, in the example above, S[1:3] extracts items at offsets 1 and 2. It grabs the second and third items, and stops before the fourth item at offset 3. Next S[1:] gets all items past the first—the upper bound defaults to the length of the string. Finally, S[:-1] fetches all but the last item—the lower bound defaults to zero, and -1 refers to the last item, non-inclusive.

This may seem confusing on first glance, but indexing and slicing are simple and powerful to use, once you get the knack. Remember, if you’re unsure about what a slice means, try it out interactively. In the next chapter, you’ll see that it’s also possible to change an entire section of a certain object in one step, by assigning to a slice. Here’s a summary of the details for reference:

I ndexing (S[i]) fetches components at offsets

  • The first item is at offset 0.

  • Negative indexes mean to count backwards from the end or right.

  • S[0] fetches the first item.

  • S[-2] fetches the second from the end (like S[len(S)-2]).

S licing (S[i:j]) extracts contiguous sections of a sequence

  • The upper bound is noninclusive.

  • Slice boundaries default to 0 and the sequence length, if omitted.

  • S[1:3] fetches from offsets 1 up to, but not including, 3.

  • S[1:] fetches from offset 1 through the end (length).

  • S[:3] fetches from offset 0 up to, but not including, 3.

  • S[:-1] fetches from offset 0 up to, but not including, the last item.

  • S[:] fetches from offsets 0 through the end—a top-level copy of S.

We’ll see another slicing-as-parsing example later in this section. The last item listed here turns out to be a very common trick: it makes a full top-level copy of a sequence object—an object with the same value, but a distinct piece of memory. This isn’t very useful for immutable objects like strings, but comes in handy for objects that may be changed, such as lists (more on copies in Chapter 7). Later, we’ll also see that the syntax used to index by offset (the square brackets) is used to index dictionaries by key as well; the operations look the same, but have different interpretations.

In Python 2.3, slice expressions support an optional third index, used as a step (sometimes called a stride). The step is added to the index of each item extracted. For instance, X[1:10:2] will fetch every other item in X from offsets 1-9; it will collect items from offsets 1, 3, 5, and so on. Similarly, the slicing expression "hello"[::-1] returns the new string "olleh“. For more details, see Python’s standard documentation, or run a few experiments interactively.

String Conversion Tools

You cannot add a number and a string together in Python, even if the string looks like a number (i.e., is all digits):

>>> "42" + 1
TypeError: cannot concatenate 'str' and 'int' objects

This is by design: because + can mean both addition and concatenation, the choice of conversion would be ambiguous. So, Python treats this as an error. In Python, magic is generally omitted, if it would make your life more complex.

What to do, then, if your script obtains a number as a text string from a file or user interface? The trick is that you need to employ conversion tools before you can treat a string like a number, or vice versa. For instance:

>>> int("42"), str(42)         # Convert from/to string.
(42, '42')
>>> string.atoi("42"), `42`    # Same, but older techniques
(42, '42')

The int and string.atoi functions both convert a string to a number, and the str function and backquotes around any object convert that object to its string representation (e.g., `42` converts a number to a string). Of these, int and str are the newer, and generally prescribed conversion techniques, and do not require importing the string module.

Although you can’t mix strings and number types around operators such as +, you can manually convert before that operation if needed:

>>> int("42") + 1            # Force addition.
43
>>> "spam" + str(42)         # Force concatenation.
'spam42'

Similar built-in functions handle floating-point number conversions:

>>> str(3.1415), float("1.5")
('3.1415', 1.5)

>>> text = "1.234E-10"
>>> float(text)
1.2340000000000001e-010

Later, we’ll further study the built-in eval function; it runs a string containing Python expression code, and so can convert a string to any kind of object. The functions int, string.atoi, and their relatives convert only to numbers, but this restriction means they are usually faster. As seen in Chapter 4, the string formatting expression provides another way to convert numbers to strings.

C hanging Strings

Remember the term—immutable sequence? The immutable part means that you can’t change a string in-place (e.g., by assigning to an index):

>>> S = 'spam'
>>> S[0] = "x"
Raises an error!

So how do you modify text information in Python? To change a string, you just need to build and assign a new string using tools such as concatenation and slicing, and possibly assigning the result back to the string’s original name.

>>> S = S + 'SPAM!'       # To change a string, make a new one.
>>> S
'spamSPAM!'
>>> S = S[:4] + 'Burger' + S[-1]
>>> S
'spamBurger!'

The first example adds a substring at the end of S, by concatenation; really, it makes a new string, and assigns it back to S, but you can usually think of this as changing a string. The second example replaces four characters with six by slicing, indexing, and concatenating. Later in this section, you’ll see how to achieve a similar effect with string method calls. Finally, it’s also possible to build up new text values with string formatting expressions:

>>> 'That is %d %s bird!' % (1, 'dead')    # like C sprintf
That is 1 dead bird!

The next section shows how.

String Formatting

Python overloads the % binary operator to work on strings (the % operator also means remainder-of-division modulus for numbers). When applied to strings, it serves the same role as C’s sprintf function; the % provides a simple way to format values as strings, according to a format definition string. In short, this operator provides a compact way to code multiple string substitutions.

To format strings:

  1. Provide a format string on the left of the % operator with embedded conversion targets that start with a % (e.g., "%d“).

  2. Provide an object (or objects in parenthesis) on the right of the % operator that you want Python to insert into the format string on the left at its conversion targets.

For instance, in the last example of the prior section, the integer 1 replaces the %d in the format string on the left, and the string 'dead' replaces the %s. The result is a new string that reflects these two substitutions.

Technically speaking, the string formatting expression is usually optional—you can generally do similar work with multiple concatenations and conversions. However, formatting allows us to combine many steps into a single operation. It’s powerful enough to warrant a few more examples:

>>> exclamation = "Ni"
>>> "The knights who say %s!" % exclamation
'The knights who say Ni!'

>>> "%d %s %d you" % (1, 'spam', 4)
'1 spam 4 you'

>>> "%s -- %s -- %s" % (42, 3.14159, [1, 2, 3])
'42 -- 3.14159 -- [1, 2, 3]'

The first example here plugs the string "Ni" into the target on the left, replacing the %s marker. In the second, three values are inserted into the target string. When there is more than one value being inserted, you need to group the values on the right in parentheses (which really means they are put in a tuple). Keep in mind that formatting always makes a new string, rather than changing the string on the left; since strings are immutable, it must.

Notice that the third example inserts three values again—an integer, floating-point, and list object—but all of the targets on the left are %s, which stands for conversion to string. Since every type of object can be converted to a string (the one used when printing), every object works with the %s conversion code. Because of this, unless you will be doing some special formatting, %s is often the only code you need to remember.

Advanced String Formatting

For more advanced type-specific formatting, you can use any of conversion codes listed in Table 5-3 in formatting expressions. C programmers will recognize most of these, because Python string formatting supports all the usual C printf format codes (but returns the result, instead of displaying it like printf). Some of the format codes in the table provide alternative ways to format the same type; for instance, %e, %f, and %g, provide alternative ways to format floating-point numbers.

Table 5-3. String formatting codes

Code

Meaning

%s

String (or any object)

%r

s, but uses repr( ), not str( )

%c

Character

%d

Decimal (integer)

%i

Integer

%u

Unsigned (integer)

%o

Octal integer

%x

Hex integer

%X

X, but prints uppercase

%e

Floating-point exponent

%E

e, but prints uppercase

%f

Floating-point decimal

%g

Floating-point e or f

%G

Floating-point E or f

%%

Literal '%'

In fact, conversion targets in the format string on the expression’s left side support a variety of conversion operations, with a fairly sophisticated syntax all their own. The general structure of conversion targets looks like this:

%[(name)][flags][width][.precision]code

The character codes in Table 5-3 show up at the end of the target string. Between the % and the character code, we can give a dictionary key; list flags that specify things like left justification (-), numeric sign (+), and zero fills (0); give total field width and the number of digits after a decimal point; and more.

Formatting target syntax is documented in full elsewhere, but to demonstrate commonly-used format syntax, here are a few examples. The first formats integers by default, and then in a six-character field with left justification and zero padding:

>>> x = 1234
>>> res = "integers: ...%d...%-6d...%06d" % (x, x, x)
>>> res
'integers: ...1234...1234  ...001234'

The %e, %f, and %g formats display floating-point numbers in different ways, such as:

>>> x = 1.23456789
>>> x
1.2345678899999999

>>> '%e | %f | %g' % (x, x, x)
'1.234568e+000 | 1.234568 | 1.23457'

For floating-point numbers, we can achieve a variety of additional formatting effects by specifying left justification, zero padding, numeric signs, field width, and digits after the decimal point. For simpler tasks, you might get by with simply converting to strings with a format expression or the str built-in function shown earlier:

>>> '%-6.2f | %05.2f | %+06.1f' % (x, x, x)
'1.23   | 01.23 | +001.2'

>>> "%s" % x, str(x)
('1.23456789', '1.23456789')

String formatting also allows conversion targets on the left to refer to the keys in a dictionary on the right, to fetch the corresponding value. We haven’t told you much about dictionaries yet, but here’s the basics for future reference:

>>> "%(n)d %(x)s" % {"n":1, "x":"spam"}
'1 spam'

Here, the (n) and (x) in the format string refer to keys in the dictionary literal on the right, and fetch their associated values. This trick is often used in conjunction with the vars built-in function, which returns a dictionary containing all the variables that exist in the place it is called:

>>> food = 'spam'
>>> age = 40
>>> vars(  )
{'food': 'spam', 'age': 40, ...many more... }

When used on the right of a format operation, this allows the format string to refer to variables by name (i.e., by dictionary key):

>>> "%(age)d %(food)s" % vars(  )
'40 spam'

We’ll say much more about dictionaries in Chapter 6. See also Section 4.3 for examples that convert to hexidecimal and octal strings with the %x and %o formatting target codes.

String Methods

In addition to expression operators, strings provide a set of methods that implement more sophisticated text processing tasks. Methods are simply functions that are associated with a particular object. Technically, they are attributes attached to objects, which happen to reference a callable function. In Python, methods are specific to object types; string methods, for example, only work on string objects.

Functions are packages of code, and method calls combine two operations at once—an attribute fetch, and a call:

Attribute fetches

An expression of the form object.attribute means “fetch the value of attribute in object.”

Call expressions

An expression of the form function(arguments) means “invoke the code of function, passing zero or more comma-separated argument objects to it, and returning the function’s result value.”

Putting these two together allows us to call a method of an object. The method call expression object.method(arguments) is evaluated from left to right—Python will first fetch the method of the object, and then call it, passing in the arguments. If the method computes a result, it will come back as the result of the entire method call expression.

As you’ll see throughout Part II, most objects have callable methods, and all are accessed using this same method call syntax. To call an object method, you have to go through an existing object; let’s move on to some examples to see how.

String Method Examples: Changing Strings

Table 5-4 summarizes the call patterns for built-in string methods. They implement higher-level operations, like splitting and joining, case conversions and tests, and substring searches. Let’s work through some code that demonstrates some of the most commonly used methods in action, and presents Python text-processing basics along the way.

Table 5-4. String method calls
S.capitalize(  )
S.ljust(width)
S.center(width)
S.lower(  )
S.count(sub [, start [, end]])
S.lstrip(  )
S.encode([encoding [,errors]])
S.replace(old, new [, maxsplit])
S.endswith(suffix [, start [, end]])
S.rfind(sub [,start [,end]])
S.expandtabs([tabsize])
S.rindex(sub [, start [, end]])
S.find(sub [, start [, end]])
S.rjust(width)
S.index(sub [, start [, end]])
S.rstrip(  )
S.isalnum(  )
S.split([sep [,maxsplit]])
S.isalpha(  )
S.splitlines([keepends])
S.isdigit(  )
S.startswith(prefix [, start [, end]])
S.islower(  )
S.strip(  )
S.isspace(  )
S.swapcase(  )
S.istitle(  )
S.title(  )
S.isupper(  )
S.translate(table [, delchars])
S.join(seq)
S.upper(  )

Because strings are immutable, they cannot be changed in-place directly. To make a new text value, you can always construct a new string with operations such as slicing and concatenating. For example, to replace two characters in the middle of a string:

>>> S = 'spammy'
>>> S = S[:3] + 'xx' + S[5:]
>>> S
'spaxxy'

But if you’re really just out to replace a substring, you can use the string replace method instead:

>>> S = 'spammy'
>>> S = S.replace('mm', 'xx')
>>> S
'spaxxy'

The replace method is more general than this code implies. It takes as arguments the original substring (of any length), and the string (of any length) to replace it with, and performs a global search-and replace:

>>> 'aa$bb$cc$dd'.replace('$', 'SPAM')
'aaSPAMbbSPAMccSPAMdd'

In such roles, replace can be used to implement template replacement sorts of tools (e.g., form letters). Notice how the result simply prints this time, instead of assigning it to a name; you need to assign results to names only if you want to retain them for later use. If you need to replace one fixed-size string that can occur at any offset, you either can do replacement again, or search for the substring with the string find method and slice:

>>> S = 'xxxxSPAMxxxxSPAMxxxx'
>>> where = S.find('SPAM')          # Search for position
>>> where                           # Occurs at offset 4
4
>>> S = S[:where] + 'EGGS' + S[(where+4):]
>>> S
'xxxxEGGSxxxxSPAMxxxx'

The find method returns the offset where the substring appears (by default, searching from the front), or -1 if it is not found. Another way is to use replace with a third argument to limit it to a single substitution:

>>> S = 'xxxxSPAMxxxxSPAMxxxx'
>>> S.replace('SPAM', 'EGGS')           # Replace all
'xxxxEGGSxxxxEGGSxxxx'

>>> S.replace('SPAM', 'EGGS', 1)        # Replace one
'xxxxEGGSxxxxSPAMxxxx'

Notice that replace is returning a new string each time here. Because strings are immutable, methods never really change the subject string in-place, even if they are called “replace.”

In fact, one potential downside of using either concatenation or the replace method to change strings, is that they both generate new string objects, each time they are run. If you have to apply many changes to a very large string, you might be able to improve your script’s performance by converting the string to an object that does support in-place changes:

>>> S = 'spammy'
>>> L = list(S)
>>> L
['s', 'p', 'a', 'm', 'm', 'y']

The built-in list function (or an object construction call), builds a new list out of the items in any sequence—in this case, “exploding” the characters of a string into a list. Once in this form, you can make multiple changes, without generating copies of the string for each change:

>>> L[3] = 'x'              # Works for lists, not strings
>>> L[4] = 'x'
>>> L
['s', 'p', 'a', 'x', 'x', 'y']

If, after your changes, you need to convert back to a string (e.g., to write to a file), use the string join method to “implode” the list back into a string:

>>> S = ''.join(L)
>>> S
'spaxxy'

The join method may look a bit backward at first sight. Because it is a method of strings (not the list of strings), it is called through the desired delimiter. join puts the list’s strings together, with the delimiter between list items; in this case, using an empty string delimiter to convert from list back to string. More generally, any string delimiter and strings list will do:

>>> 'SPAM'.join(['eggs', 'sausage', 'ham', 'toast'])
'eggsSPAMsausageSPAMhamSPAMtoast'

String Method Examples: Parsing Text

Another common role for string methods is as a simple form of text parsing—analyzing structure and extracting substrings. To extract substrings at fixed offsets, we can employ slicing techniques:

>>> line = 'aaa bbb ccc'
>>> col1 = line[0:3]
>>> col3 = line[8:]
>>> col1
'aaa'
>>> col3
'ccc'

Here, the columns of data appear at fixed offsets, and so may be sliced out of the original string. This technique passes for parsing, as long as your data has fixed positions for its components. If the data is separated by some sort of delimiter instead, we can pull out its components by splitting, even if the data may show up at arbitrary positions within the string:

>>> line = 'aaa bbb   ccc'
>>> cols = line.split(  )
>>> cols
['aaa', 'bbb', 'ccc']

The string split method chops up a string into a list of substrings, around a delimiter string. We didn’t pass a delimiter in the prior example, so it defaults to whitespace—the string is split at groups of one or more spaces, tabs, and newlines, and we get back a list of the resulting substrings. In other applications, the data may be separated by more tangible delimiters, such as keywords or commas:

>>> line = 'bob,hacker,40'
>>> line.split(',')
['bob', 'hacker', '40']

This example splits (and hence parses) the string at commas, a separator common in data returned by some database tools. Delimiters can be longer than a single character too:

>>> line = "i'mSPAMaSPAMlumberjack"
>>> line.split("SPAM")
["i'm", 'a', 'lumberjack']

Although there are limits to the parsing potential of slicing and splitting, both run very fast, and can handle basic text extraction cores.

You’ll meet additional string examples later in this book. For more details, also see the Python library manual and other documentation sources, or simply experiment with these interactively on your own. Note that none of the string methods accept patterns—for pattern-based text processing, you must use the Python re standard library module. Because of this limitation, though, string methods sometimes run more quickly than the re module’s tools.

The Original Module

Python’s string method story is somewhat convoluted by history. For roughly the first decade of Python’s existence, it provided a standard library module called string, which contained functions that largely mirror the current set of string object methods. Later, in Python 2.0 (and the short-lived 1.6), these functions were made available as methods of string objects, in response to user requests. Because so many people wrote so much code that relied on the original string module, it is retained for backward compatibility.

The upshot of this legacy is that today, there are usually two ways to invoke advanced string operations—by calling object methods, or calling string module functions and passing in the object as an argument. For instance, given a variable X assigned to a string object, calling an object method:

X.method(arguments)

is usually equivalent to calling the same operation through the module:

string.method(X, arguments)

provided that you have already imported the module string. Here’s an example of both call patterns in action—first, the method scheme:

>>> S = 'a+b+c+'
>>> x = S.replace('+', 'spam')
>>> x
'aspambspamcspam'

To access the same operation through the module, you need to import the module (at least once in your process), and pass in the object:

>>> import string
>>> y = string.replace(S, '+', 'spam')
>>> y
'aspambspamcspam'

Because the module approach was the standard for so long, and because strings are such a central component of most programs, you will probably see both call patterns in Python code you come across.

Today, though, the general recommendation is to use methods instead of the module. The module call scheme requires you to import the string module (methods do not). The string module makes calls a few characters longer to type (at least when you load the module with import, but not for from). In addition, the module may run more slowly than methods (the current module maps most calls back to the methods, and so incurs an extra call along the way).

On the other hand, because the overlap between module and method tools is not exact, you may still sometimes need to use either scheme—some methods are only available as methods, and some as module functions. In addition, some programmers prefer to use the module call pattern, because the module’s name makes it more obvious that code is calling string tools: string.method(x) seems more self-documenting than x.method( ) to some. As always, the choice should ultimately be yours to make.

General Type Categories

Now that we’ve seen the first collection object, the string, let’s pause to define a few general type concepts that will apply to most of the types from here on. In regard to built-in types, it turns out that operations work the same for all types in a category, so we only need to define most ideas once. We’ve only seen numbers and strings so far; but because they are representative of two of the three major type categories in Python, you already know more about other types than you think.

Types Share Operation Sets by Categories

Strings are immutable sequences: they cannot be changed in place (the immutable part), and are positionally-ordered collections that are accessed by offsets (the sequence part). Now, it so happens that all the sequences seen in this part of the book respond to the same sequence operations shown at work on strings—concatenation, indexing, iteration, and so on. More formally, there are three type (and operation) categories in Python:

Numbers

Support addition, multiplication, etc.

Sequences

Support indexing, slicing, concatenation, etc.

Mappings

Support indexing by key, etc.

We haven’t seen mappings yet (dictionaries are discussed in the next chapter), but other types are going to be mostly more of the same. For example, for any sequence objects X and Y:

  • X + Y makes a new sequence object with the contents of both operands.

  • X * N makes a new sequence object with N copies of the sequence operand X.

In other words, these operations work the same on any kind of sequence—strings, lists, tuples, and some user-defined object types. The only difference is that you get back a new result object that is the same type as the operands X and Y—if you concatenate lists, you get back a new list, not a string. Indexing, slicing, and other sequence operations work the same on all sequences too; the type of the objects being processed tells Python which task to perform.

Mutable Types Can Be Changed in-Place

The immutable classification is an important constraint to know yet it tends to trip up new users. If an object type is immutable, you cannot change its value in-place; Python raises an error if you try. Instead, run code to make a new object for a new value. Generally, immutable types give some degree of integrity, by guaranteeing that an object won’t be changed by another part of a program. You’ll see why this matters when shared object references are discussed in Chapter 7.



[2] But if you’re especially interested in binary data files: the chief distinction is that you open them in binary mode (use open mode flags with a “b”, such as “rb”, “wb”, and so on). See also the standard struct module, which can parse binary data loaded from a file.

[3] Unlike C character arrays, you don’t need to allocate or manage storage arrays when using Python strings. Simply create string objects as needed, and let Python manage the underlying memory space. Python reclaims unused objects’ memory space automatically, using a reference-count garbage collection strategy. Each object keeps track of the number of names, data-structures, etc. that reference it; when the count reaches zero, Python frees the object’s space. This scheme means Python doesn’t have to stop and scan all of memory to find unused space to free (an additional garbage component also collects cyclic objects).

[4] More mathematically minded readers (and students in my classes) sometimes detect a small asymmetry here: the leftmost item is at offset 0, but the rightmost is at offset -1. Alas, there is no such thing as a distinct -0 value in Python.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset