encoding problem in Python when urlopen() a gbk page

后端 未结 1 880
甜味超标
甜味超标 2021-01-07 13:25

My code here:

# coding:utf-8

if __name__ == \'__main__\':
    from urllib2 import urlopen
    url = \'http://iccna.blog.sohu.com/164572951.html\'
    data =         


        
1条回答
  •  生来不讨喜
    2021-01-07 13:51

    The problem is that the server returns the data compressed by Gzip. Try this:

    #-*- coding: utf-8 -*-
    from __future__ import print_function
    
    import gzip
    import StringIO
    import urllib2
    from BeautifulSoup import BeautifulSoup
    
    url = 'http://iccna.blog.sohu.com/164572951.html'
    response = urllib2.urlopen(url)
    data = response.read()
    data = StringIO.StringIO(data)
    gzipper = gzip.GzipFile(fileobj=data)
    html = gzipper.read()
    soup = BeautifulSoup(html, fromEncoding='gbk')
    print(soup)
    

    The Chinese characters look still wrong on my system, but this may give you right direction.

    0 讨论(0)
提交回复
热议问题