Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Detecting and converting character encodings

A common occurrence with text processing is finding text that has nonstandard character encoding. Ideally, all text would be ASCII or utf-8, but that's just not the reality. In cases when you have non-ASCII or non-utf-8 text and you don't know what the character encoding is, you'll need to detect it and convert the text to a standard encoding before doing further processing.

Getting ready

You'll need to install the charade module using sudo pip install charade or sudo easy_install charade. You can learn more about charade at https://pypi.python.org/pypi/charade.

How to do it...

Encoding detection and conversion functions are provided in encoding.py. These are simple wrapper functions around the charade module. To detect the encoding of a string, call encoding.detect(string). You'll get back a dict containing two attributes: confidence and encoding. The confidence attribute is a probability of how confident charade is that the value for encoding is correct.

# -*- coding: utf-8 -*-
import charade

def detect(s):
  try:
    if isinstance(s, str):
      return charade.detect(s.encode())
    else:
      return charade.detect(s)
  except UnicodeDecodeError:
    return charade.detect(s.encode('utf-8'))

def convert(s):
  if isinstance(s, str):
    s = s.encode()

  encoding = detect(s)['encoding']

  if encoding == 'utf-8':
    return s.decode()
  else:
    return s.decode(encoding)

And here's some example code using detect() to determine character encoding:

>>> import encoding
>>> encoding.detect('ascii')
{'confidence': 1.0, 'encoding': 'ascii'}
>>> encoding.detect('abcdé')
{'confidence': 0.505, 'encoding': 'utf-8'}
>>> encoding.detect(bytes('222222223225', 'latin-1'))
{'confidence': 0.5, 'encoding': 'windows-1252'}

To convert a string to a standard unicode encoding, call encoding.convert(). This will decode the string from its original encoding and then re-encode it as utf-8.

>>> encoding.convert('ascii')
'ascii'
>>> encoding.convert('abcdé')
'abcdé'
>>> encoding.convert((bytes('222222223225', 'latin-1'))
'u2019u2019u201cu2022'

How it works...

The detect() function is a wrapper around charade.detect() that can encode strings and handle UnicodeDecodeError exceptions. The charade.detect() method expects a bytes object, not a string, so in these cases, the string is encoded before trying to detect the encoding.

The convert() function first calls detect() to get the encoding and, then returns a decoded string.

There's more...

The comment at the top of the module, # -*- coding: utf-8 -*-, is a hint to the Python interpreter that tells which encoding to use for the strings in the code. This is helpful for when you have non-ASCII strings in your source code, and is documented in detail at http://www.python.org/dev/peps/pep-0263/.

Converting to ASCII

If you want pure ASCII text, with non-ASCII characters converted to ASCII equivalents or dropped if there is no equivalent character, then you can use the unicodedata.normalize() function:

>>> import unicodedata
>>> unicodedata.normalize('NFKD', 'abcdxe9').encode('ascii', 'ignore')
b'abcde'

Specifying 'NFKD' as the first argument ensures that the non-ASCII characters are replaced with their equivalent ASCII versions, and the final call to encode() with 'ignore' as the second argument will remove any extraneous unicode characters. This returns a bytes object, which you can call decode() on to get a string.

UnicodeDammit conversion

The BeautifulSoup library contains a helper class called UnicodeDammit, which can do automatic conversion to unicode. Its usage is very simple:

>>> from bs4 import UnicodeDammit
>>> UnicodeDammit('abcdxe9').unicode_markup
'abcdé'

Installing BeautifulSoup is covered in the previous recipe, Converting HTML entities with BeautifulSoup.

Table of Contents for
Detecting and converting character encodings

Detecting and converting character encodings

Getting ready

How to do it...

How it works...

There's more...

Converting to ASCII

UnicodeDammit conversion

See also

Table of Contents for Detecting and converting character encodings

Create new playlist

Sign In

Sign Up

Detecting and converting character encodings

Getting ready

How to do it...

How it works...

There's more...

Converting to ASCII

UnicodeDammit conversion

See also

Table of Contents for
Detecting and converting character encodings