The next major built-in type is the Python string—an ordered collection of characters, used to store and represent text-based information. From a functional perspective, strings can be used to represent just about anything that can be encoded as text: symbols and words (e.g., your name), contents of text files loaded into memory, Internet addresses, Python programs, and so on.
You may have used strings in other languages too;
Python’s strings serve the same role as character
arrays in languages such as C, but Python’s strings
are a somewhat higher level tool than arrays. Unlike C, Python
strings come with a powerful set of processing tools. Also unlike
languages like C, Python has no special type for single characters
(like C’s char
), only
one-character strings.
Strictly speaking, Python strings are categorized as immutable sequences—meaning that they have a left-to-right positional order (sequence), and cannot be changed in place (immutable). In fact, strings are the first representative of the larger class of objects called sequences. Pay special attention to the operations introduced here, because they will work the same on other sequence types we’ll see later, such as lists and tuples.
Table 5-1 introduces common string literals and operations. Empty strings are written as two quotes with nothing in between, and there are a variety of ways to code strings. For processing, strings support expression operations such as concatenation (combining strings), slicing (extracting sections), indexing (fetching by offset), and so on. Besides expressions, Python also provides a set of string methods that implement common string-specific tasks, as well as a string module that mirrors most string methods.
Operation |
Interpretation |
Empty string | |
|
Double quotes |
|
Triple-quoted blocks |
|
Raw strings |
|
Unicode Strings |
|
Concatenate, repeat |
|
Index, slice, length |
" |
String formatting |
|
String method calls |
|
Iteration, membership |
Methods and modules are discussed later in this section. Beyond the
core set of string tools, Python also supports more advanced
pattern-based string processing with the standard
library’s re
(regular expression)
module, introduced in Chapter 27. This section
starts with an overview of string literal forms and basic string
expressions, and then looks at more advanced tools such as string
methods and formatting.
By and large, strings are fairly easy to use in Python. Perhaps the most complicated thing about them is that there are so many ways to write them in your code:
Single quotes: 'spa"m
'
Double quotes: "spa'm
"
Triple quotes: '''... spam ...''
', """...
spam ...""
"
Escape sequences: "s p
a m
"
Raw strings: r"C:
ew est.spm
"
Unicode strings: u'eggsu0020spam
'
The single- and double-quoted forms are by far the most common; the others serve specialized roles. Let’s take a quick look at each of these options.
Around Python strings, single and double quote characters are interchangeable. That is, string literals can be written enclosed in either two single or two double quotes—the two forms work the same, and return the same type of object. For example, the following two strings are identical, once coded:
>>> 'shrubbery', "shrubbery"
('shrubbery', 'shrubbery')
The reason for including both is that it allows you to embed a quote character of the other variety inside a string, without escaping it with a backslash: you may embed a single quote character in a string enclosed in double quote characters, and vice-versa:
>>> 'knight"s', "knight's"
('knight"s', "knight's")
Incidentally, Python automatically concatenates adjacent string
literals, although it is almost as simple to add a
+
operator between them, to invoke concatenation
explicitly.
>>>title = "Meaning " 'of' " Life"
>>>title
'Meaning of Life'
Notice in all of these outputs that Python prefers to print strings in single quotes, unless they embed one. You can also embed quotes by escaping them with backslashes:
>>> 'knight's', "knight"s"
("knight's", 'knight"s')
But to understand why, we need to explain how escapes work in general.
The last example embedded a quote inside a string by preceding it with a backslash. This is representative of a general pattern in strings: backslashes are used to introduce special byte codings, known as escape sequences.
Escape sequences let us embed byte codes in strings that cannot be
easily typed on a keyboard. The character , and
one or more characters following it in the string literal, are
replaced with a single character in the resulting string object,
which has the binary value specified by the escape sequence. For
example, here is a five-character string that embeds a newline and a
tab:
>>> s = 'a
b c'
The two characters
stand for a single
character—the byte containing the binary value of the newline
character in your character set (usually, ASCII code 10). Similarly,
the sequence
is replaced with the tab
character. The way this string looks when printed depends on how you
print it. The interactive echo shows the special characters as
escapes, but print
interprets them instead:
>>>s
'a b c' >>>print s
a b c
To be completely sure how many bytes are in this string, you can use
the built-in len
function—it returns the
actual number of bytes in a string, regardless of how it is
displayed.
>>> len(s)
5
This string is five bytes long: an ASCII “a” byte, a newline byte, an ASCII “b” byte, and so on; the original backslash characters are not really stored with the string in memory.
For coding such special bytes, Python recognizes a full set of escape code sequences, listed in Table 5-2. Some sequences allow you to embed absolute binary values into the bytes of a string. For instance, here’s another five-character string that embeds two binary zero bytes:
>>>s = 'a b c'
>>>s
'ax00bx00c' >>>len(s)
5
Escape |
Meaning |
|
Ignored (continuation) |
|
Backslash (keeps a ) |
|
Single quote (keeps `) |
|
Double quote (keeps “) |
|
Bell |
|
Backspace |
|
Formfeed |
|
Newline (linefeed) |
|
Carriage return |
|
Horizontal tab |
|
Vertical tab |
|
Unicode dbase id |
|
Unicode 16-bit hex |
|
Unicode 32-bit hex[1] |
|
Hex digits value hh |
|
Octal digits value |
|