问题
Here's my code, you guys can also test it out. I always get messed-up characters instead of page source.
Header = {"User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 GTB7.1 (.NET CLR 3.5.30729)"}
Req = urllib2.Request("http://rlslog.net", None, Header)
Response = urllib2.urlopen(Req)
Html = Response.read()
print Html[:1000]
Normally Html
should be page source, but it ended up to be tons of messed-up characters. Anybody knows why?
BTW: I'm on python 2.7
回答1:
As Bruce already suggested, it seems to be a problem with compression. The server returns gzip compressed content, but urllib2
does not support automatic gzip compression. In fact, the server is misbehaving in this case as far as I know: it should only compress the content if an Accept-encoding: gzip
header is present (which you either provide yourself, or is automatically added by your client if it supports it).
So: either use a library that supports it automatically, like httplib2 (which I've tested with the page in question, and it works), or decompress yourself (see the answer to this SO question for how to do it, note that in the question the headers returned by the server are checked to see if the content is gzip compressed)
回答2:
You make your request with a user agent which supports on the fly compression. Are you sure that the output is not gzip compressed ? Try running it through zlib module and/or printing headers
来源:https://stackoverflow.com/questions/7231280/why-i-got-messy-characters-while-opening-url-using-urllib2