A common occurrence with text processing is finding text that has nonstandard character encoding. Ideally, all text would be ASCII or utf-8, but that's just not the reality. In cases when you have non-ASCII or non-utf-8 text and you don't know what the character encoding is, you'll need to detect it and convert the text to a standard encoding before doing further processing.
You'll need to install the charade
module using sudo pip install charade
or sudo easy_install charade
. You can learn more about charade
at https://pypi.python.org/pypi/charade.
Encoding detection and conversion functions are provided in encoding.py
. These are simple wrapper functions around the charade
module. To detect the encoding of a string, call encoding.detect(string)
. You'll get back a dict
containing two attributes: confidence
and encoding
. The confidence
attribute is a probability of how confident charade
is that the value for encoding
is correct.
# -*- coding: utf-8 -*- import charade def detect(s): try: if isinstance(s, str): return charade.detect(s.encode()) else: return charade.detect(s) except UnicodeDecodeError: return charade.detect(s.encode('utf-8')) def convert(s): if isinstance(s, str): s = s.encode() encoding = detect(s)['encoding'] if encoding == 'utf-8': return s.decode() else: return s.decode(encoding)
And here's some example code using detect()
to determine character encoding:
>>> import encoding >>> encoding.detect('ascii') {'confidence': 1.0, 'encoding': 'ascii'} >>> encoding.detect('abcdé') {'confidence': 0.505, 'encoding': 'utf-8'} >>> encoding.detect(bytes('222222223225', 'latin-1')) {'confidence': 0.5, 'encoding': 'windows-1252'}
To convert a string to a standard unicode
encoding, call encoding.convert()
. This will decode the string from its original encoding and then re-encode it as utf-8.
>>> encoding.convert('ascii') 'ascii' >>> encoding.convert('abcdé') 'abcdé' >>> encoding.convert((bytes('222222223225', 'latin-1')) 'u2019u2019u201cu2022'
The detect()
function is a wrapper around charade.detect()
that can encode strings and handle UnicodeDecodeError
exceptions. The charade.detect()
method expects a bytes
object, not a string, so in these cases, the string is encoded before trying to detect the encoding.
The
convert()
function first calls detect()
to get the encoding and, then returns a decoded string.
The comment at the top of the module, # -*- coding: utf-8 -*-
, is a hint to the Python interpreter that tells which encoding to use for the strings in the code. This is helpful for when you have non-ASCII strings in your source code, and is documented in detail at http://www.python.org/dev/peps/pep-0263/.
If you want pure ASCII text, with non-ASCII characters converted to ASCII equivalents or dropped if there is no equivalent character, then you can use the unicodedata.normalize()
function:
>>> import unicodedata >>> unicodedata.normalize('NFKD', 'abcdxe9').encode('ascii', 'ignore') b'abcde'
Specifying 'NFKD'
as the first argument ensures that the non-ASCII characters are replaced with their equivalent ASCII versions, and the final call to encode()
with 'ignore'
as the second argument will remove any extraneous unicode characters. This returns a bytes
object, which you can call decode()
on to get a string.
The BeautifulSoup
library contains a helper class called UnicodeDammit
, which can do automatic conversion to unicode. Its usage is very simple:
>>> from bs4 import UnicodeDammit >>> UnicodeDammit('abcdxe9').unicode_markup 'abcdé'
Installing BeautifulSoup
is covered in the previous recipe, Converting HTML entities with BeautifulSoup.