Detecting and converting character encodings

A common occurrence with text processing is finding text that has nonstandard character encoding. Ideally, all text would be ASCII or utf-8, but that's just not the reality. In cases when you have non-ASCII or non-utf-8 text and you don't know what the character encoding is, you'll need to detect it and convert the text to a standard encoding before doing further processing.

Getting ready

You'll need to install the charade module using sudo pip install charade or sudo easy_install charade. You can learn more about charade at https://pypi.python.org/pypi/charade.

How to do it...

Encoding detection and conversion functions are provided in encoding.py. These are simple wrapper functions around the charade module. To detect the encoding of a string, call encoding.detect(string). You'll get back a dict containing two attributes: confidence and encoding. The confidence attribute is a probability of how confident charade is that the value for encoding is correct.

# -*- coding: utf-8 -*-
import charade

def detect(s):
  try:
    if isinstance(s, str):
      return charade.detect(s.encode())
    else:
      return charade.detect(s)
  except UnicodeDecodeError:
    return charade.detect(s.encode('utf-8'))

def convert(s):
  if isinstance(s, str):
    s = s.encode()

  encoding = detect(s)['encoding']

  if encoding == 'utf-8':
    return s.decode()
  else:
    return s.decode(encoding)

And here's some example code using detect() to determine character encoding:

>>> import encoding
>>> encoding.detect('ascii')
{'confidence': 1.0, 'encoding': 'ascii'}
>>> encoding.detect('abcdé')
{'confidence': 0.505, 'encoding': 'utf-8'}
>>> encoding.detect(bytes('222222223225', 'latin-1'))
{'confidence': 0.5, 'encoding': 'windows-1252'}

To convert a string to a standard unicode encoding, call encoding.convert(). This will decode the string from its original encoding and then re-encode it as utf-8.

>>> encoding.convert('ascii')
'ascii'
>>> encoding.convert('abcdé')
'abcdé'
>>> encoding.convert((bytes('222222223225', 'latin-1'))
'u2019u2019u201cu2022'

How it works...

The detect() function is a wrapper around charade.detect() that can encode strings and handle UnicodeDecodeError exceptions. The charade.detect() method expects a bytes object, not a string, so in these cases, the string is encoded before trying to detect the encoding.

The convert() function first calls detect() to get the encoding and, then returns a decoded string.

There's more...

The comment at the top of the module, # -*- coding: utf-8 -*-, is a hint to the Python interpreter that tells which encoding to use for the strings in the code. This is helpful for when you have non-ASCII strings in your source code, and is documented in detail at http://www.python.org/dev/peps/pep-0263/.

Converting to ASCII

If you want pure ASCII text, with non-ASCII characters converted to ASCII equivalents or dropped if there is no equivalent character, then you can use the unicodedata.normalize() function:

>>> import unicodedata
>>> unicodedata.normalize('NFKD', 'abcdxe9').encode('ascii', 'ignore')
b'abcde'

Specifying 'NFKD' as the first argument ensures that the non-ASCII characters are replaced with their equivalent ASCII versions, and the final call to encode() with 'ignore' as the second argument will remove any extraneous unicode characters. This returns a bytes object, which you can call decode() on to get a string.

UnicodeDammit conversion

The BeautifulSoup library contains a helper class called UnicodeDammit, which can do automatic conversion to unicode. Its usage is very simple:

>>> from bs4 import UnicodeDammit
>>> UnicodeDammit('abcdxe9').unicode_markup
'abcdé'

Installing BeautifulSoup is covered in the previous recipe, Converting HTML entities with BeautifulSoup.

See also

Encoding detection and conversion is a recommended first step before doing HTML processing with lxml or BeautifulSoup, covered in the Extracting URLs from HTML with lxml and Converting HTML entities with BeautifulSoup recipes.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset