Best way to decode unknown unicoding encoding in Python 2.5 [duplicate]

前端未结

关注

 3  1996

清酒与你

相关标签:

3条回答

伪装坚强ぢ

2021-02-01 11:11

I've tackled the same problem and found that there's no way to determine a content's encoding type without metadata about the content. That's why I ended up with the same approach you're trying here.

My only additional advice to what you've done is, rather than ordering the list of possible encoding in most-likely order, you should order it by specificity. I've found that certain character sets are subsets of others, and so if you check utf_8 as your second choice, you'll miss ever finding the subsets of utf_8 (I think one of the Korean character sets uses the same number space as utf).

0 讨论(0)
发布评论:

提交评论
- 加载中...
夕颜

2021-02-01 11:15

Since you are using Python, you might try UnicodeDammit. It is part of Beautiful Soup that you also may find useful.

Like the name suggests, UnicodeDammit will try to do whatever it takes to get proper unicode out of the crap you may find in the world.

0 讨论(0)
发布评论:

提交评论
- 加载中...
情深已故

2021-02-01 11:24
There are two general purpose libraries for detecting unknown encodings:
- chardet, part of Universal Feed Parser
- UnicodeDammit, part of Beautiful Soup
chardet is supposed to be a port of the way that firefox does it

You can use the following regex to detect utf8 from byte strings:
```
import re

utf8_detector = re.compile(r"""^(?:
     [\x09\x0A\x0D\x20-\x7E]            # ASCII
   | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
   |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
   | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
   |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
   |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
   | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
   |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
  )*$""", re.X)
```
In practice if you're dealing with English I've found the following works 99.9% of the time:
1. if it passes the above regex, it's ascii or utf8
2. if it contains any bytes from 0x80-0x9f but not 0xa4, it's Windows-1252
3. if it contains 0xa4, assume it's latin-15
4. otherwise assume it's latin-1
0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题