Beautiful Soup default decode charset?

问题

I have a huge set of web pages with different encodings, and I try to parse it using Beautiful Soup.

As I have noticed, BS detects encoding using meta-charset or xml-encoding tags. But there are documents with no such tags or typos in charset name - and BS fails on all of them. I suppose it's default guess is utf-8, which is wrong. Luckily, all such pages (or nearly all of them) have the same encoding. Is there any way to set it as default?

I've also tried to grep charset and use iconv to utf8 first - it works nice, and provides perfectly readable utf-8 encoded output, but BS BeautifulSoup(sys.stdin.read()) sometimes (rarely, like 0.05% of all files) randomly fails on it with

UnicodeDecodeError: 'utf8' codec can't decode byte *** in position ***: invalid start byte

The basic reason here, for my mind, is that while actual encoding is already utf-8, meta-tags still state the previous one, so BS is confused. It has really strange behavior here - like it works smoothly when I delete one or another random character (like '-' or '*' etc. - not any wicked strange one) - so I gave up on it, and I really wish to proceed with native Beautiful Soup decoding, while it is also a bit faster.

回答1:

BeautifulSoup will indeed use an educated guess using a character detection library. That process can be wrong; removing just one character can indeed radically change the outcome for certain types of documents.

You can override this guess by specifying an input codec:

soup = BeautifulSoup(source, from_encoding=codec)

You could use exception handling here to only apply the manual codec when decoding failed:

try:
    soup = BeautifulSoup(source)
except UnicodeDecodeError:
    soup = BeautifulSoup(source, from_encoding=codec)

Also see the Encodings section of the BeautifulSoup documentation.

来源：https://stackoverflow.com/questions/29255200/beautiful-soup-default-decode-charset

标签

python

encoding

utf-8

character-encoding

beautifulsoup