Urllib's urlopen breaking on some sites (e.g. StackApps api): returns garbage results

只愿长相守 提交于 2020-01-02 01:46:32

问题


I'm using urllib2's urlopen function to try and get a JSON result from the StackOverflow api.

The code I'm using:

>>> import urllib2
>>> conn = urllib2.urlopen("http://api.stackoverflow.com/0.8/users/")
>>> conn.readline()

The result I'm getting:

'\x1f\x8b\x08\x00\x00\x00\x00\x00\x04\x00\xed\xbd\x07`\x1cI\x96%&/m\xca{\x7fJ\...

I'm fairly new to urllib, but this doesn't seem like the result I should be getting. I've tried it in other places and I get what I expect (the same as visiting the address with a browser gives me: a JSON object).

Using urlopen on other sites (e.g. "http://google.com") works fine, and gives me actual html. I've also tried using urllib and it gives the same result.

I'm pretty stuck, not even knowing where to look to solve this problem. Any ideas?


回答1:


That almost looks like something you would be feeding to pickle. Maybe something in the User-Agent string or Accepts header that urllib2 is sending is causing StackOverflow to send something other than JSON.

One telltale is to look at conn.headers.headers to see what the Content-Type header says.

And this question, Odd String Format Result from API Call, may have your answer. Basically, you might have to run your result through a gzip decompressor.

Double checking with this code:

>>> req = urllib2.Request("http://api.stackoverflow.com/0.8/users/",
                          headers={'Accept-Encoding': 'gzip, identity'})
>>> conn = urllib2.urlopen(req)
>>> val = conn.read()
>>> conn.close()
>>> val[0:25]
'\x1f\x8b\x08\x00\x00\x00\x00\x00\x04\x00\xed\xbd\x07`\x1cI\x96%&/m\xca{\x7fJ'

Yes, you are definitely getting gzip encoded data back.

Since you seem to be getting different results on different machines with the same version of Python, and in general it looks like the urllib2 API would require you do something special to request gzip encoded data, my guess is that you have a transparent proxy in there someplace.

I saw a presentation by the EFF at CodeCon in 2009. They were doing end-to-end connectivity testing to discover dirty ISP tricks of various kinds. One of the things they discovered while doing this testing is that a surprising number of consumer level NAT routers add random HTTP headers or do transparent proxying. You might have some piece of equipment on your network that's adding or modifying the Accept-Encoding header in order to make your connection seem faster.



来源:https://stackoverflow.com/questions/3028426/urllibs-urlopen-breaking-on-some-sites-e-g-stackapps-api-returns-garbage-re

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!