There are two general purpose libraries for detecting unknown encodings:
- chardet, part of Universal Feed Parser
- UnicodeDammit, part of Beautiful Soup
chardet is supposed to be a port of the way that firefox does it
You can use the following regex to detect utf8 from byte strings:
import re
utf8_detector = re.compile(r"""^(?:
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$""", re.X)
In practice if you're dealing with English I've found the following works 99.9% of the time:
- if it passes the above regex, it's ascii or utf8
- if it contains any bytes from 0x80-0x9f but not 0xa4, it's Windows-1252
- if it contains 0xa4, assume it's latin-15
- otherwise assume it's latin-1