Why urllib returns garbage from some wikipedia articles?

前端 未结 3 1913
刺人心
刺人心 2021-01-14 00:33
>>> import urllib2

>>> good_article = \'http://en.wikipedia.org/wiki/Wikipedia\'
>>> bad_article = \'http://en.wikipedia.org/wiki/India\'         


        
相关标签:
3条回答
  • 2021-01-14 00:39

    I think there is something else causing you a problem. That series of bytes looks some encoded content.

    import urllib2
    bad_article = 'http://en.wikipedia.org/wiki/India'
    req = urllib2.Request(bad_article)
    req.add_header('User-Agent', 'Mozilla/5.0')
    result = urllib2.urlopen(req)
    print result.readline()
    

    resulted in this

    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
    

    which is correct.

    0 讨论(0)
  • 2021-01-14 00:39

    Do a "curl -i" for both the links. If its coming fine, there is no environment problem.

    0 讨论(0)
  • 2021-01-14 00:57

    It's not an environment, locale, or encoding problem. The offending stream of bytes is gzip-compressed. The \x1f\x8B at the start is what you get at the start of a gzip stream with the default settings.

    Looks as though the server is ignoring the fact that you didn't do

    req2.add_header('Accept-encoding', 'gzip')

    You should look at result.headers.getheader('Content-Encoding') and if necessary, decompress it yourself.

    0 讨论(0)
提交回复
热议问题