Credit: Paul Prescod
You have XML documents that may use a large variety of Unicode encodings, and you need to find out which encoding each document is using.
This is a task that we need to code ourselves, rather than getting an existing package to perform it, if we want complete generality:
import codecs, encodings """ Caller will hand this library a buffer and ask it to convert it or autodetect the type. """ # None represents a potentially variable byte. "##" in the XML spec... autodetect_dict={ # bytepattern : ("name", (0x00, 0x00, 0xFE, 0xFF) : ("ucs4_be"), (0xFF, 0xFE, 0x00, 0x00) : ("ucs4_le"), (0xFE, 0xFF, None, None) : ("utf_16_be"), (0xFF, 0xFE, None, None) : ("utf_16_le"), (0x00, 0x3C, 0x00, 0x3F) : ("utf_16_be"), (0x3C, 0x00, 0x3F, 0x00) : ("utf_16_le"), (0x3C, 0x3F, 0x78, 0x6D): ("utf_8"), (0x4C, 0x6F, 0xA7, 0x94): ("EBCDIC") } def autoDetectXMLEncoding(buffer): """ buffer -> encoding_name The buffer should be at least four bytes long. Returns None if encoding cannot be detected. Note that encoding_name might not have an installed decoder (e.g., EBCDIC) """ # A more efficient implementation would not decode the whole # buffer at once, but then we'd have to decode a character at # a time looking for the quote character, and that's a pain encoding = "utf_8" # According to the XML spec, this is the default # This code successively tries to refine the default: # Whenever it fails to refine, it falls back to # the last place encoding was set bytes = byte1, byte2, byte3, byte4 = tuple(map(ord, buffer[0:4])) enc_info = autodetect_dict.get(bytes, None) if not enc_info: # Try autodetection again, removing potentially # variable bytes bytes = byte1, byte2, None, None enc_info = autodetect_dict.get(bytes) if enc_info: encoding = enc_info # We have a guess...these are # the new defaults # Try to find a more precise encoding using XML declaration secret_decoder_ring = codecs.lookup(encoding)[1] decoded, length = secret_decoder_ring(buffer) first_line = decoded.split(" ", 1)[0] if first_line and first_line.startswith(u"<?xml"): encoding_pos = first_line.find(u"encoding") if encoding_pos!=-1: # Look for double quotes quote_pos = first_line.find('"', encoding_pos) if quote_pos==-1: # Look for single quote quote_pos = first_line.find("'", encoding_pos) if quote_pos>-1: quote_char = first_line[quote_pos] rest = first_line[quote_pos+1:] encoding = rest[:rest.find(quote_char)] return encoding
The XML specification describes the outlines of an algorithm for detecting the Unicode encoding that an XML document uses. This recipe implements this algorithm and helps your XML processing programs find out which encoding is being used by a specific document.
The default encoding (unless we can
determine another one specifically) must be
UTF-8, as this is part
of the specifications that define XML. Certain byte patterns in the
first four, or sometimes even just the first two, bytes of the text,
can let us identify a different encoding. For example, if the text
starts with the 2 bytes 0xFF, 0xFE
we can be
certain this is a byte-order mark that identifies the encoding type
as little-endian (low byte before high byte in each character) and
the encoding itself as UTF-16 (or the 32-bits-per-character UCS-4 if
the next 2 bytes in the text are 0, 0
).
If we get as far as this, we must also examine the first line of the
text by decoding the text from a byte string into Unicode with the
encoding determined so far, and detecting the first line-end
'
'
character. If the first line begins with
u'<?xml'
, it’s an
XML
declaration and may explicitly specify an encoding by using the
keyword encoding
as an attribute. The nested
if
statements in the recipe check for that, and,
if they find an encoding thus specified, the recipe returns it as the
encoding it has determined. This step is absolutely crucial, since
any text starting with the single-byte ASCII-like representation of
the XML declaration, <?xml
, would be otherwise erroneously
identified as encoded in UTF-8, while its explicit encoding attribute
may specify it as being, for example, one of the ISO-8859 standard
encodings.
This code detects a variety of encodings, including some that are not yet supported by Python’s Unicode decoders. So the fact that you can decipher the encoding does not guarantee that you can decipher the document itself!
Recipe 3.18 and Recipe 3.19; Unicode is a huge topic, but a recommended book is Unicode: A Primer, by Tony Graham (Hungry Minds, Inc.)—details are available at http://www.menteith.com/unicode/primer/.