Converting between bytes and str

To convert between bytes and str we must know the encoding of the byte sequence used to represent the string's Unicode code points as bytes. Python supports a wide-variety of so-called codecs such as UTF-8, UTF-16, ASCII, Latin-1, Windows-1251, and so on – consult the Python documentation for a current list of codecs

In Python we can encode a Unicode str into a bytes object, and going the other way we can decode a bytes object into a Unicode str. In either direction it's up to us to specify the encoding. Python won't — and generally speaking can't do anything to prevent you erroneously decoding UTF-16 data stored in a bytes object using, say, a CP037 codec for handling strings on legacy IBM mainframes.

If you're lucky the decoding will fail with a UnicodeError at runtime; if you're unlucky you'll wind up with a str full of garbage that will go undetected by your program.

Figure 2.2: Encoding and Decoding.

Let's kick off an interactive session looking at strings, with an interesting Unicode string which contains all the characters of the 29 letter Norwegian alphabet – a pangram:

>>> norsk = "Jeg begynte å fortære en sandwich mens jeg kjørte taxi på vei til quiz"

We'll now encode that using the UTF-8 codec into a bytes object using the encode() method of the str object:

>>> data = norsk.encode('utf-8')
>>> data
b'Jeg begynte xc3xa5 fortxc3xa6re en sandwich mens jeg kjxc3xb8rte taxi pxc3xa5 vei til quiz'

See how each of the Norwegian letters has been rendered as a pair of bytes.

We can reverse the process using the decode() method of the bytes object. Again, it is up to us to supply the correct encoding:

>>> norwegian = data.decode('utf-8')

We can check that the encoding/decoding round-trip gives us a result equal to what we started with:

>>> norwegian == norsk
True

Let's try to display it for good measure:

>>> norwegian
'Jeg begynte å fortære en sandwich mens jeg kjørte taxi på vei til quiz'

All this messing about with encodings may seem like unnecessary detail at this juncture – especially if you operate in an anglophone environment – but it's crucial to understand since files and network resources such as HTTP responses are transmitted as byte streams, whereas we prefer to work with the convenience of Unicode strings.

String differences between Python 3 and Python 2
The biggest difference between contemporary Python 3 and legacy Python 2 is the handling of strings. In versions of Python up to and including Python 2 the str type was a so-called byte string, where each character was encoded as a single byte. In this sense, Python 2 str was similar to the Python 3 bytes, however, the interface presented by str and bytes is in fact different in significant ways. In particular their constructors are completely different and indexing into a bytes object returns an integer rather than a single code point string. To confuse matters further, there is also a bytes type in Python 2.6 and Python 2.7, but this is just a synonym for str and as such has an identical interface. If you're writing text handling code intended to be portable across Python 2 and Python 3 – which is perfectly possible – tread carefully!
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset