Gibberish from urlopen

♀尐吖头ヾ 提交于 2019-12-10 22:15:10

问题


I am trying to read some utf-8 files from the addresses in the code below. It works for most of them, but for some files the urllib2 (and urllib) is unable to read.

The obvious answer here is that the second file is corrupt, but the strange thing is that IE reads them both with no problem at all. The code has been tested on both XP and Linux, with identical results. Any sugestions?

import urllib2
#This works:
f=urllib2.urlopen("http://www.gutenberg.org/cache/epub/145/pg145.txt")
line=f.readline()
print "this works: %s)" %(line)
line=unicode(line,'utf-8') #... works fine

#This doesn't
f=urllib2.urlopen("http://www.gutenberg.org/cache/epub/144/pg144.txt")
line=f.readline()
print "this doesn't: %s)" %(line)
line=unicode(line,'utf-8')#...causes an exception:

回答1:


>>> f=urllib2.urlopen("http://www.gutenberg.org/cache/epub/144/pg144.txt")
>>> f.headers.dict
{'content-length': '304513', ..., 'content-location': 'pg144.txt.utf8.gzip', 'content-encoding': 'gzip', ..., 'content-type': 'text/plain; charset=utf-8'}

Either set a header that prevents the site sending a gzip-encoded response, or decode it first.




回答2:


The URL you're asking for seems to refer to a private cache. Try http://www.gutenberg.org/files/144/144-0.txt instead (found at http://www.gutenberg.org/ebooks/144).

If you really want to use the /cache/ URL: The server is sending you gzipped data, not unicode. urllib2 does not ask for gzipped data and doesn't decode it, which is correct behavior. See this question for how to uncompress it.




回答3:


You know it's not a solution, but you should look http://pypi.python.org/pypi/requests library, no matter if you still want to use urllib can look the source code of Requests, to understand how it works with utf-8 strings .



来源:https://stackoverflow.com/questions/7964726/gibberish-from-urlopen

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!