Beautiful Soup default decode charset?

旧时模样 提交于 2021-02-05 08:44:07

问题


I have a huge set of web pages with different encodings, and I try to parse it using Beautiful Soup.

As I have noticed, BS detects encoding using meta-charset or xml-encoding tags. But there are documents with no such tags or typos in charset name - and BS fails on all of them. I suppose it's default guess is utf-8, which is wrong. Luckily, all such pages (or nearly all of them) have the same encoding. Is there any way to set it as default?

I've also tried to grep charset and use iconv to utf8 first - it works nice, and provides perfectly readable utf-8 encoded output, but BS BeautifulSoup(sys.stdin.read()) sometimes (rarely, like 0.05% of all files) randomly fails on it with

UnicodeDecodeError: 'utf8' codec can't decode byte *** in position ***: invalid start byte

The basic reason here, for my mind, is that while actual encoding is already utf-8, meta-tags still state the previous one, so BS is confused. It has really strange behavior here - like it works smoothly when I delete one or another random character (like '-' or '*' etc. - not any wicked strange one) - so I gave up on it, and I really wish to proceed with native Beautiful Soup decoding, while it is also a bit faster.


回答1:


BeautifulSoup will indeed use an educated guess using a character detection library. That process can be wrong; removing just one character can indeed radically change the outcome for certain types of documents.

You can override this guess by specifying an input codec:

soup = BeautifulSoup(source, from_encoding=codec)

You could use exception handling here to only apply the manual codec when decoding failed:

try:
    soup = BeautifulSoup(source)
except UnicodeDecodeError:
    soup = BeautifulSoup(source, from_encoding=codec)

Also see the Encodings section of the BeautifulSoup documentation.



来源:https://stackoverflow.com/questions/29255200/beautiful-soup-default-decode-charset

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!