how to decode and encode web page with python?

前端 未结 3 2156
慢半拍i
慢半拍i 2021-01-07 06:19

I use Beautifulsoup and urllib2 to download web pages, but different web page has a different encode method, such as utf-8,gb2312,gbk. I use urllib2 get sohu\'s home page, w

相关标签:
3条回答
  • 2021-01-07 06:33

    Using BeautifulSoup you can parse the HTML and access the original_encoding attrbute:

    import urllib2
    from bs4 import BeautifulSoup
    
    html = urllib2.urlopen('http://www.sohu.com').read()
    soup = BeautifulSoup(html)
    
    >>> soup.original_encoding
    u'gbk'
    

    And this agrees with the encoding declared in the <meta> tag in the HTML's <head>:

    <meta http-equiv="content-type" content="text/html; charset=GBK" />
    
    >>> soup.meta['content']
    u'text/html; charset=GBK'
    

    Now you can decode the HTML:

    decoded_html = html.decode(soup.original_encoding)
    

    but there not much point since the HTML is already available as unicode:

    >>> soup.a['title']
    u'\u641c\u72d0-\u4e2d\u56fd\u6700\u5927\u7684\u95e8\u6237\u7f51\u7ad9'
    >>> print soup.a['title']
    搜狐-中国最大的门户网站
    >>> soup.a.text
    u'\u641c\u72d0'
    >>> print soup.a.text
    搜狐
    

    It is also possible to attempt to detect it using the chardet module (although it is a bit slow):

    >>> import chardet
    >>> chardet.detect(html)
    {'confidence': 0.99, 'encoding': 'GB2312'}
    
    0 讨论(0)
  • 2021-01-07 06:33

    Another solution.

    from simplified_scrapy.request import req
    from simplified_scrapy.simplified_doc import SimplifiedDoc
    html = req.get('http://www.sohu.com') # This will automatically help you find the correct encoding
    doc = SimplifiedDoc(html)
    print (doc.title.text)
    
    0 讨论(0)
  • 2021-01-07 06:45

    I know this is an old question, but I spent a while today puzzling over a particularly problematic website so I thought I'd share the solution that worked for me, which I got from here: http://shunchiubc.blogspot.com/2016/08/python-to-scrape-chinese-websites.html

    Requests has a feature that will automatically get the actual encoding of the website, meaning you don't have to wrestle with encoding/decoding it (before I found this, I was getting all sorts of errors trying to encode/decode strings/bytes and never getting any output which was readable). This feature is called apparent_encoding. Here's how it worked for me:

    from bs4 import BeautifulSoup
    import requests
    
    url = 'http://url_youre_using_here.html'
    readOut = requests.get(url)
    readOut.encoding = readOut.apparent_encoding #sets the encoding properly before you hand it off to BeautifulSoup
    soup = BeautifulSoup(readOut.text, "lxml")
    
    0 讨论(0)
提交回复
热议问题