问题
I am trying to read some utf-8 files from the addresses in the code below. It works for most of them, but for some files the urllib2 (and urllib) is unable to read.
The obvious answer here is that the second file is corrupt, but the strange thing is that IE reads them both with no problem at all. The code has been tested on both XP and Linux, with identical results. Any sugestions?
import urllib2
#This works:
f=urllib2.urlopen("http://www.gutenberg.org/cache/epub/145/pg145.txt")
line=f.readline()
print "this works: %s)" %(line)
line=unicode(line,'utf-8') #... works fine
#This doesn't
f=urllib2.urlopen("http://www.gutenberg.org/cache/epub/144/pg144.txt")
line=f.readline()
print "this doesn't: %s)" %(line)
line=unicode(line,'utf-8')#...causes an exception:
回答1:
>>> f=urllib2.urlopen("http://www.gutenberg.org/cache/epub/144/pg144.txt")
>>> f.headers.dict
{'content-length': '304513', ..., 'content-location': 'pg144.txt.utf8.gzip', 'content-encoding': 'gzip', ..., 'content-type': 'text/plain; charset=utf-8'}
Either set a header that prevents the site sending a gzip-encoded response, or decode it first.
回答2:
The URL you're asking for seems to refer to a private cache. Try http://www.gutenberg.org/files/144/144-0.txt instead (found at http://www.gutenberg.org/ebooks/144).
If you really want to use the /cache/
URL: The server is sending you gzipped data, not unicode. urllib2
does not ask for gzipped data and doesn't decode it, which is correct behavior.
See this question for how to uncompress it.
回答3:
You know it's not a solution, but you should look http://pypi.python.org/pypi/requests library, no matter if you still want to use urllib can look the source code of Requests, to understand how it works with utf-8 strings .
来源:https://stackoverflow.com/questions/7964726/gibberish-from-urlopen