Python - BeautifulSoup html parsing handle gbk encoding poorly - Chinese webscraping problem

后端 未结 1 1468
梦毁少年i
梦毁少年i 2021-01-15 13:21

I have been tinkering with the following script:

#    -*- coding: utf8 -*-
import codecs
from BeautifulSoup import BeautifulSoup, NavigableString,
UnicodeDam         


        
相关标签:
1条回答
  • 2021-01-15 13:39

    The file's meta tag claims that the character set is GB2312, but the data contains a character from the newer GBK/GB18030 and this is what's tripping BeautifulSoup up:

    simon@lucifer:~$ python
    Python 2.7 (r27:82508, Jul  3 2010, 21:12:11) 
    [GCC 4.0.1 (Apple Inc. build 5493)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import urllib2
    >>> data = urllib2.urlopen('http://stock.eastmoney.com/news/1408,20101022101395594.html').read()
    >>> data.decode("gb2312")
    Traceback (most recent call last):
      File "", line 1, in 
    UnicodeDecodeError: 'gb2312' codec can't decode bytes in position 20148-20149: illegal multibyte sequence
    

    At this point, UnicodeDammit bails out, tries chardet, UTF-8 and finally Windows-1252, which always succeeds - this is what you got, by the looks of it.

    If we tell the decoder to replace unrecognised characters with a '?', we can see the character that's missing in GB2312:

    >>> print data[20140:20160].decode("gb2312", "replace")
    毒尾气二�英的排放难
    

    Using the correct encoding:

    >>> print data[20140:20160].decode("gb18030", "replace")
    毒尾气二噁英的排放难
    >>> from BeautifulSoup import BeautifulSoup
    >>> s = BeautifulSoup(data, fromEncoding="gb18030")
    >>> print s.findAll("p")[2].string[:10]
      信息通信技术是&
    

    Also:

    >>> print s.findAll("p")[2].string
      信息通信技术是“十二五”规划重点发展方向,行业具有很强的内在增长潜
    力,增速远高于GDP。软件外包、服务外包、管理软件、车载导航、网上购物、网络游戏、
    移动办公、移动网络游戏、网络视频等均存在很强的潜在需求,使信息技术行业继续保持较
    高增长。
    
    0 讨论(0)
提交回复
热议问题