Chapter 9. Strings and Serialization

Before we get involved with higher level design patterns, let's take a deep dive into one of Python's most common objects: the string. We'll see that there is a lot more to the string than meets the eye, and also cover searching strings for patterns and serializing data for storage or transmission.

In particular, we'll visit:

  • The complexities of strings, bytes, and byte arrays
  • The ins and outs of string formatting
  • A few ways to serialize data
  • The mysterious regular expression

Strings

Strings are a basic primitive in Python; we've used them in nearly every example we've discussed so far. All they do is represent an immutable sequence of characters. However, though you may not have considered it before, "character" is a bit of an ambiguous word; can Python strings represent sequences of accented characters? Chinese characters? What about Greek, Cyrillic, or Farsi?

In Python 3, the answer is yes. Python strings are all represented in Unicode, a character definition standard that can represent virtually any character in any language on the planet (and some made-up languages and random characters as well). This is done seamlessly, for the most part. So, let's think of Python 3 strings as an immutable sequence of Unicode characters. So what can we do with this immutable sequence? We've touched on many of the ways strings can be manipulated in previous examples, but let's quickly cover it all in one place: a crash course in string theory!

String manipulation

As you know, strings can be created in Python by wrapping a sequence of characters in single or double quotes. Multiline strings can easily be created using three quote characters, and multiple hardcoded strings can be concatenated together by placing them side by side. Here are some examples:

a = "hello"
b = 'world'
c = '''a multiple
line string'''
d = """More
multiple"""
e = ("Three " "Strings "
        "Together")

That last string is automatically composed into a single string by the interpreter. It is also possible to concatenate strings using the + operator (as in "hello " + "world"). Of course, strings don't have to be hardcoded. They can also come from various outside sources such as text files, user input, or encoded on the network.

Note

The automatic concatenation of adjacent strings can make for some hilarious bugs when a comma is missed. It is, however, extremely useful when a long string needs to be placed inside a function call without exceeding the 79 character line-length limit suggested by the Python style guide.

Like other sequences, strings can be iterated over (character by character), indexed, sliced, or concatenated. The syntax is the same as for lists.

The str class has numerous methods on it to make manipulating strings easier. The dir and help commands in the Python interpreter can tell us how to use all of them; we'll consider some of the more common ones directly.

Several Boolean convenience methods help us identify whether or not the characters in a string match a certain pattern. Here is a summary of these methods. Most of these, such as isalpha, isupper/islower, and startswith/endswith have obvious interpretations. The isspace method is also fairly obvious, but remember that all whitespace characters (including tab, newline) are considered, not just the space character.

The istitle method returns True if the first character of each word is capitalized and all other characters are lowercase. Note that it does not strictly enforce the English grammatical definition of title formatting. For example, Leigh Hunt's poem "The Glove and the Lions" should be a valid title, even though not all words are capitalized. Robert Service's "The Cremation of Sam McGee" should also be a valid title, even though there is an uppercase letter in the middle of the last word.

Be careful with the isdigit, isdecimal, and isnumeric methods, as they are more nuanced than you would expect. Many Unicode characters are considered numbers besides the ten digits we are used to. Worse, the period character that we use to construct floats from strings is not considered a decimal character, so '45.2'.isdecimal() returns False. The real decimal character is represented by Unicode value 0660, as in 45.2, (or 45u06602). Further, these methods do not verify whether the strings are valid numbers; "127.0.0.1" returns True for all three methods. We might think we should use that decimal character instead of a period for all numeric quantities, but passing that character into the float() or int() constructor converts that decimal character to a zero:

>>> float('45u06602')
4502.0

Other methods useful for pattern matching do not return Booleans. The count method tells us how many times a given substring shows up in the string, while find, index, rfind, and rindex tell us the position of a given substring within the original string. The two 'r' (for 'right' or 'reverse') methods start searching from the end of the string. The find methods return -1 if the substring can't be found, while index raises a ValueError in this situation. Have a look at some of these methods in action:

>>> s = "hello world"
>>> s.count('l')
3
>>> s.find('l')
2
>>> s.rindex('m')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: substring not found

Most of the remaining string methods return transformations of the string. The upper, lower, capitalize, and title methods create new strings with all alphabetic characters in the given format. The translate method can use a dictionary to map arbitrary input characters to specified output characters.

For all of these methods, note that the input string remains unmodified; a brand new str instance is returned instead. If we need to manipulate the resultant string, we should assign it to a new variable, as in new_value = value.capitalize(). Often, once we've performed the transformation, we don't need the old value anymore, so a common idiom is to assign it to the same variable, as in value = value.title().

Finally, a couple of string methods return or operate on lists. The split method accepts a substring and splits the string into a list of strings wherever that substring occurs. You can pass a number as a second parameter to limit the number of resultant strings. The rsplit behaves identically to split if you don't limit the number of strings, but if you do supply a limit, it starts splitting from the end of the string. The partition and rpartition methods split the string at only the first or last occurrence of the substring, and return a tuple of three values: characters before the substring, the substring itself, and the characters after the substring.

As the inverse of split, the join method accepts a list of strings, and returns all of those strings combined together by placing the original string between them. The replace method accepts two arguments, and returns a string where each instance of the first argument has been replaced with the second. Here are some of these methods in action:

>>> s = "hello world, how are you"
>>> s2 = s.split(' ')
>>> s2
['hello', 'world,', 'how', 'are', 'you']
>>> '#'.join(s2)
'hello#world,#how#are#you'
>>> s.replace(' ', '**')
'hello**world,**how**are**you'
>>> s.partition(' ')
('hello', ' ', 'world, how are you')

There you have it, a whirlwind tour of the most common methods on the str class! Now, let's look at Python 3's method for composing strings and variables to create new strings.

String formatting

Python 3 has a powerful string formatting and templating mechanism that allows us to construct strings comprised of hardcoded text and interspersed variables. We've used it in many previous examples, but it is much more versatile than the simple formatting specifiers we've used.

Any string can be turned into a format string by calling the format() method on it. This method returns a new string where specific characters in the input string have been replaced with values provided as arguments and keyword arguments passed into the function. The format method does not require a fixed set of arguments; internally, it uses the *args and **kwargs syntax that we discussed in Chapter 7, Python Object-oriented Shortcuts.

The special characters that are replaced in formatted strings are the opening and closing brace characters: { and }. We can insert pairs of these in a string and they will be replaced, in order, by any positional arguments passed to the str.format method:

template = "Hello {}, you are currently {}."
print(template.format('Dusty', 'writing'))

If we run these statements, it replaces the braces with variables, in order:

Hello Dusty, you are currently writing.

This basic syntax is not terribly useful if we want to reuse variables within one string or decide to use them in a different position. We can place zero-indexed integers inside the curly braces to tell the formatter which positional variable gets inserted at a given position in the string. Let's repeat the name:

template = "Hello {0}, you are {1}. Your name is {0}."
print(template.format('Dusty', 'writing'))

If we use these integer indexes, we have to use them in all the variables. We can't mix empty braces with positional indexes. For example, this code fails with an appropriate ValueError exception:

template = "Hello {}, you are {}. Your name is {0}."
print(template.format('Dusty', 'writing'))

Escaping braces

Brace characters are often useful in strings, aside from formatting. We need a way to escape them in situations where we want them to be displayed as themselves, rather than being replaced. This can be done by doubling the braces. For example, we can use Python to format a basic Java program:

template = """
public class {0} {{
    public static void main(String[] args) {{
        System.out.println("{1}");
    }}
}}"""

print(template.format("MyClass", "print('hello world')"));

Wherever we see the {{ or }} sequence in the template, that is, the braces enclosing the Java class and method definition, we know the format method will replace them with single braces, rather than some argument passed into the format method. Here's the output:

public class MyClass {
    public static void main(String[] args) {
        System.out.println("print('hello world')");
    }
}

The class name and contents of the output have been replaced with two parameters, while the double braces have been replaced with single braces, giving us a valid Java file. Turns out, this is about the simplest possible Python program to print the simplest possible Java program that can print the simplest possible Python program!

Keyword arguments

If we're formatting complex strings, it can become tedious to remember the order of the arguments or to update the template if we choose to insert a new argument. The format method therefore allows us to specify names inside the braces instead of numbers. The named variables are then passed to the format method as keyword arguments:

template = """
From: <{from_email}>
To: <{to_email}>
Subject: {subject}

{message}"""
print(template.format(
    from_email = "[email protected]",
    to_email = "[email protected]",
    message = "Here's some mail for you. "
    " Hope you enjoy the message!",
    subject = "You have mail!"
    ))

We can also mix index and keyword arguments (as with all Python function calls, the keyword arguments must follow the positional ones). We can even mix unlabeled positional braces with keyword arguments:

print("{} {label} {}".format("x", "y", label="z"))

As expected, this code outputs:

x z y

Container lookups

We aren't restricted to passing simple string variables into the format method. Any primitive, such as integers or floats can be printed. More interestingly, complex objects, including lists, tuples, dictionaries, and arbitrary objects can be used, and we can access indexes and variables (but not methods) on those objects from within the format string.

For example, if our e-mail message had grouped the from and to e-mail addresses into a tuple, and placed the subject and message in a dictionary, for some reason (perhaps because that's the input required for an existing send_mail function we want to use), we can format it like this:

emails = ("[email protected]", "[email protected]")
message = {
        'subject': "You Have Mail!",
        'message': "Here's some mail for you!"
        }
template = """
From: <{0[0]}>
To: <{0[1]}>
Subject: {message[subject]}
{message[message]}"""
print(template.format(emails, message=message))

The variables inside the braces in the template string look a little weird, so let's look at what they're doing. We have passed one argument as a position-based parameter and one as a keyword argument. The two e-mail addresses are looked up by 0[x], where x is either 0 or 1. The initial zero represents, as with other position-based arguments, the first positional argument passed to format (the emails tuple, in this case).

The square brackets with a number inside are the same kind of index lookup we see in regular Python code, so 0[0] maps to emails[0], in the emails tuple. The indexing syntax works with any indexable object, so we see similar behavior when we access message[subject], except this time we are looking up a string key in a dictionary. Notice that unlike in Python code, we do not need to put quotes around the string in the dictionary lookup.

We can even do multiple levels of lookup if we have nested data structures. I would recommend against doing this often, as template strings rapidly become difficult to understand. If we have a dictionary that contains a tuple, we can do this:

emails = ("[email protected]", "[email protected]")
message = {
        'emails': emails,
        'subject': "You Have Mail!",
        'message': "Here's some mail for you!"
        }
template = """
From: <{0[emails][0]}>
To: <{0[emails][1]}>
Subject: {0[subject]}
{0[message]}"""
print(template.format(message))

Object lookups

Indexing makes format lookup powerful, but we're not done yet! We can also pass arbitrary objects as parameters, and use the dot notation to look up attributes on those objects. Let's change our e-mail message data once again, this time to a class:

class EMail:
    def __init__(self, from_addr, to_addr, subject, message):
        self.from_addr = from_addr
        self.to_addr = to_addr
        self.subject = subject
        self.message = message

email = EMail("[email protected]", "[email protected]",
        "You Have Mail!",
         "Here's some mail for you!")

template = """
From: <{0.from_addr}>
To: <{0.to_addr}>
Subject: {0.subject}

{0.message}"""
print(template.format(email))

The template in this example may be more readable than the previous examples, but the overhead of creating an e-mail class adds complexity to the Python code. It would be foolish to create a class for the express purpose of including the object in a template. Typically, we'd use this sort of lookup if the object we are trying to format already exists. This is true of all the examples; if we have a tuple, list, or dictionary, we'll pass it into the template directly. Otherwise, we'd just create a simple set of positional and keyword arguments.

Making it look right

It's nice to be able to include variables in template strings, but sometimes the variables need a bit of coercion to make them look right in the output. For example, if we are doing calculations with currency, we may end up with a long decimal that we don't want to show up in our template:

subtotal = 12.32
tax = subtotal * 0.07
total = subtotal + tax

print("Sub: ${0} Tax: ${1} Total: ${total}".format(
    subtotal, tax, total=total))

If we run this formatting code, the output doesn't quite look like proper currency:

Sub: $12.32 Tax: $0.8624 Total: $13.182400000000001

Note

Technically, we should never use floating-point numbers in currency calculations like this; we should construct decimal.Decimal() objects instead. Floats are dangerous because their calculations are inherently inaccurate beyond a specific level of precision. But we're looking at strings, not floats, and currency is a great example for formatting!

To fix the preceding format string, we can include some additional information inside the curly braces to adjust the formatting of the parameters. There are tons of things we can customize, but the basic syntax inside the braces is the same; first, we use whichever of the earlier layouts (positional, keyword, index, attribute access) is suitable to specify the variable that we want to place in the template string. We follow this with a colon, and then the specific syntax for the formatting. Here's an improved version:

print("Sub: ${0:0.2f} Tax: ${1:0.2f} "
        "Total: ${total:0.2f}".format(
            subtotal, tax, total=total))

The 0.2f format specifier after the colons basically says, from left to right: for values lower than one, make sure a zero is displayed on the left side of the decimal point; show two places after the decimal; format the input value as a float.

We can also specify that each number should take up a particular number of characters on the screen by placing a value before the period in the precision. This can be useful for outputting tabular data, for example:

orders = [('burger', 2, 5),
        ('fries', 3.5, 1),
        ('cola', 1.75, 3)]

print("PRODUCT    QUANTITY    PRICE    SUBTOTAL")
for product, price, quantity in orders:
    subtotal = price * quantity
    print("{0:10s}{1: ^9d}    ${2: <8.2f}${3: >7.2f}".format(
        product, quantity, price, subtotal))

Ok, that's a pretty scary looking format string, so let's see how it works before we break it down into understandable parts:

PRODUCT    QUANTITY    PRICE    SUBTOTAL
burger        5        $2.00    $  10.00
fries         1        $3.50    $   3.50
cola          3        $1.75    $   5.25

Nifty! So, how is this actually happening? We have four variables we are formatting, in each line in the for loop. The first variable is a string and is formatted with {0:10s}. The s means it is a string variable, and the 10 means it should take up ten characters. By default, with strings, if the string is shorter than the specified number of characters, it appends spaces to the right side of the string to make it long enough (beware, however: if the original string is too long, it won't be truncated!). We can change this behavior (to fill with other characters or change the alignment in the format string), as we do for the next value, quantity.

The formatter for the quantity value is {1: ^9d}. The d represents an integer value. The 9 tells us the value should take up nine characters. But with integers, instead of spaces, the extra characters are zeros, by default. That looks kind of weird. So we explicitly specify a space (immediately after the colon) as a padding character. The caret character ^ tells us that the number should be aligned in the center of this available padding; this makes the column look a bit more professional. The specifiers have to be in the right order, although all are optional: fill first, then align, then the size, and finally, the type.

We do similar things with the specifiers for price and subtotal. For price, we use {2: <8.2f} and for subtotal, {3: >7.2f}. In both cases, we're specifying a space as the fill character, but we use the < and > symbols, respectively, to represent that the numbers should be aligned to the left or right within the minimum space of eight or seven characters. Further, each float should be formatted to two decimal places.

The "type" character for different types can affect formatting output as well. We've seen the s, d, and f types, for strings, integers, and floats. Most of the other format specifiers are alternative versions of these; for example, o represents octal format and X represents hexadecimal for integers. The n type specifier can be useful for formatting integer separators in the current locale's format. For floating-point numbers, the % type will multiply by 100 and format a float as a percentage.

While these standard formatters apply to most built-in objects, it is also possible for other objects to define nonstandard specifiers. For example, if we pass a datetime object into format, we can use the specifiers used in the datetime.strftime function, as follows:

import datetime
print("{0:%Y-%m-%d %I:%M%p }".format(
    datetime.datetime.now()))

It is even possible to write custom formatters for objects we create ourselves, but that is beyond the scope of this module. Look into overriding the __format__ special method if you need to do this in your code. The most comprehensive instructions can be found in PEP 3101 at http://www.python.org/dev/peps/pep-3101/, although the details are a bit dry. You can find more digestible tutorials using a web search.

The Python formatting syntax is quite flexible but it is a difficult mini-language to remember. I use it every day and still occasionally have to look up forgotten concepts in the documentation. It also isn't powerful enough for serious templating needs, such as generating web pages. There are several third-party templating libraries you can look into if you need to do more than basic formatting of a few strings.

Strings are Unicode

At the beginning of this section, we defined strings as collections of immutable Unicode characters. This actually makes things very complicated at times, because Unicode isn't really a storage format. If you get a string of bytes from a file or a socket, for example, they won't be in Unicode. They will, in fact, be the built-in type bytes. Bytes are immutable sequences of... well, bytes. Bytes are the lowest-level storage format in computing. They represent 8 bits, usually described as an integer between 0 and 255, or a hexadecimal equivalent between 0 and FF. Bytes don't represent anything specific; a sequence of bytes may store characters of an encoded string, or pixels in an image.

If we print a byte object, any bytes that map to ASCII representations will be printed as their original character, while non-ASCII bytes (whether they are binary data or other characters) are printed as hex codes escaped by the x escape sequence. You may find it odd that a byte, represented as an integer, can map to an ASCII character. But ASCII is really just a code where each letter is represented by a different byte pattern, and therefore, a different integer. The character "a" is represented by the same byte as the integer 97, which is the hexadecimal number 0x61. Specifically, all of these are an interpretation of the binary pattern 01100001.

Many I/O operations only know how to deal with bytes, even if the bytes object refers to textual data. It is therefore vital to know how to convert between bytes and Unicode.

The problem is that there are many ways to map bytes to Unicode text. Bytes are machine-readable values, while text is a human-readable format. Sitting in between is an encoding that maps a given sequence of bytes to a given sequence of text characters.

However, there are multiple such encodings (ASCII is only one of them). The same sequence of bytes represents completely different text characters when mapped using different encodings! So, bytes must be decoded using the same character set with which they were encoded. It's not possible to get text from bytes without knowing how the bytes should be decoded. If we receive unknown bytes without a specified encoding, the best we can do is guess what format they are encoded in, and we may be wrong.

Converting bytes to text

If we have an array of bytes from somewhere, we can convert it to Unicode using the .decode method on the bytes class. This method accepts a string for the name of the character encoding. There are many such names; common ones for Western languages include ASCII, UTF-8, and latin-1.

The sequence of bytes (in hex), 63 6c 69 63 68 e9, actually represents the characters of the word cliché in the latin-1 encoding. The following example will encode this sequence of bytes and convert it to a Unicode string using the latin-1 encoding:

characters = b'x63x6cx69x63x68xe9'
print(characters)
print(characters.decode("latin-1"))

The first line creates a bytes object; the b character immediately before the string tells us that we are defining a bytes object instead of a normal Unicode string. Within the string, each byte is specified using—in this case—a hexadecimal number. The x character escapes within the byte string, and each say, "the next two characters represent a byte using hexadecimal digits."

Provided we are using a shell that understands the latin-1 encoding, the two print calls will output the following strings:

b'clichxe9'
cliché

The first print statement renders the bytes for ASCII characters as themselves. The unknown (unknown to ASCII, that is) character stays in its escaped hex format. The output includes a b character at the beginning of the line to remind us that it is a bytes representation, not a string.

The next call decodes the string using latin-1 encoding. The decode method returns a normal (Unicode) string with the correct characters. However, if we had decoded this same string using the Cyrillic "iso8859-5" encoding, we'd have ended up with the string 'clichщ'! This is because the xe9 byte maps to different characters in the two encodings.

Converting text to bytes

If we need to convert incoming bytes into Unicode, clearly we're also going to have situations where we convert outgoing Unicode into byte sequences. This is done with the encode method on the str class, which, like the decode method, requires a character set. The following code creates a Unicode string and encodes it in different character sets:

characters = "cliché"
print(characters.encode("UTF-8"))
print(characters.encode("latin-1"))
print(characters.encode("CP437"))
print(characters.encode("ascii"))

The first three encodings create a different set of bytes for the accented character. The fourth one can't even handle that byte:

b'clichxc3xa9'
b'clichxe9'
b'clichx82'
Traceback (most recent call last):
  File "1261_10_16_decode_unicode.py", line 5, in <module>
    print(characters.encode("ascii"))
UnicodeEncodeError: 'ascii' codec can't encode character 'xe9' in position 5: ordinal not in range(128)

Do you understand the importance of encoding now? The accented character is represented as a different byte for each encoding; if we use the wrong one when we are decoding bytes to text, we get the wrong character.

The exception in the last case is not always the desired behavior; there may be cases where we want the unknown characters to be handled in a different way. The encode method takes an optional string argument named errors that can define how such characters should be handled. This string can be one of the following:

  • strict
  • replace
  • ignore
  • xmlcharrefreplace

The strict replacement strategy is the default we just saw. When a byte sequence is encountered that does not have a valid representation in the requested encoding, an exception is raised. When the replace strategy is used, the character is replaced with a different character; in ASCII, it is a question mark; other encodings may use different symbols, such as an empty box. The ignore strategy simply discards any bytes it doesn't understand, while the xmlcharrefreplace strategy creates an xml entity representing the Unicode character. This can be useful when converting unknown strings for use in an XML document. Here's how each of the strategies affects our sample word:

Strategy

"cliché".encode("ascii", strategy)

replace

b'clich?'

ignore

b'clich'

xmlcharrefreplace

b'cliché'

It is possible to call the str.encode and bytes.decode methods without passing an encoding string. The encoding will be set to the default encoding for the current platform. This will depend on the current operating system and locale or regional settings; you can look it up using the sys.getdefaultencoding() function. It is usually a good idea to specify the encoding explicitly, though, since the default encoding for a platform may change, or the program may one day be extended to work on text from a wider variety of sources.

If you are encoding text and don't know which encoding to use, it is best to use the UTF-8 encoding. UTF-8 is able to represent any Unicode character. In modern software, it is a de facto standard encoding to ensure documents in any language—or even multiple languages—can be exchanged. The various other possible encodings are useful for legacy documents or in regions that still use different character sets by default.

The UTF-8 encoding uses one byte to represent ASCII and other common characters, and up to four bytes for more complex characters. UTF-8 is special because it is backwards-compatible with ASCII; any ASCII document encoded using UTF-8 will be identical to the original ASCII document.

Note

I can never remember whether to use encode or decode to convert from binary bytes to Unicode. I always wished these methods were named "to_binary" and "from_binary" instead. If you have the same problem, try mentally replacing the word "code" with "binary"; "enbinary" and "debinary" are pretty close to "to_binary" and "from_binary". I have saved a lot of time by not looking up the method help files since devising this mnemonic.

Mutable byte strings

The bytes type, like str, is immutable. We can use index and slice notation on a bytes object and search for a particular sequence of bytes, but we can't extend or modify them. This can be very inconvenient when dealing with I/O, as it is often necessary to buffer incoming or outgoing bytes until they are ready to be sent. For example, if we are receiving data from a socket, it may take several recv calls before we have received an entire message.

This is where the bytearray built-in comes in. This type behaves something like a list, except it only holds bytes. The constructor for the class can accept a bytes object to initialize it. The extend method can be used to append another bytes object to the existing array (for example, when more data comes from a socket or other I/O channel).

Slice notation can be used on bytearray to modify the item inline. For example, this code constructs a bytearray from a bytes object and then replaces two bytes:

b = bytearray(b"abcdefgh")
b[4:6] = b"x15xa3"
print(b)

The output looks like this:

bytearray(b'abcdx15xa3gh')

Be careful; if we want to manipulate a single element in the bytearray, it will expect us to pass an integer between 0 and 255 inclusive as the value. This integer represents a specific bytes pattern. If we try to pass a character or bytes object, it will raise an exception.

A single byte character can be converted to an integer using the ord (short for ordinal) function. This function returns the integer representation of a single character:

b = bytearray(b'abcdef')
b[3] = ord(b'g')
b[4] = 68
print(b)

The output looks like this:

bytearray(b'abcgDf')

After constructing the array, we replace the character at index 3 (the fourth character, as indexing starts at 0, as with lists) with byte 103. This integer was returned by the ord function and is the ASCII character for the lowercase g. For illustration, we also replaced the next character up with the byte number 68, which maps to the ASCII character for the uppercase D.

The bytearray type has methods that allow it to behave like a list (we can append integer bytes to it, for example), but also like a bytes object; we can use methods like count and find the same way they would behave on a bytes or str object. The difference is that bytearray is a mutable type, which can be useful for building up complex sequences of bytes from a specific input source.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset