Urllib's urlopen breaking on some sites (e.g. StackApps api): returns garbage results

I'm using urllib2's urlopen function to try and get a JSON result from the StackOverflow api.

The code I'm using:

>>> import urllib2
>>> conn = urllib2.urlopen("http://api.stackoverflow.com/0.8/users/")
>>> conn.readline()

The result I'm getting:

'\x1f\x8b\x08\x00\x00\x00\x00\x00\x04\x00\xed\xbd\x07`\x1cI\x96%&/m\xca{\x7fJ\...

I'm fairly new to urllib, but this doesn't seem like the result I should be getting. I've tried it in other places and I get what I expect (the same as visiting the address with a browser gives me: a JSON object).

Using urlopen on other sites (e.g. "http://google.com") works fine, and gives me actual html. I've also tried using urllib and it gives the same result.

I'm pretty stuck, not even knowing where to look to solve this problem. Any ideas?

Omnifarious

That almost looks like something you would be feeding to pickle. Maybe something in the User-Agent string or Accepts header that urllib2 is sending is causing StackOverflow to send something other than JSON.

One telltale is to look at conn.headers.headers to see what the Content-Type header says.

And this question, Odd String Format Result from API Call, may have your answer. Basically, you might have to run your result through a gzip decompressor.

Double checking with this code:

>>> req = urllib2.Request("http://api.stackoverflow.com/0.8/users/",
                          headers={'Accept-Encoding': 'gzip, identity'})
>>> conn = urllib2.urlopen(req)
>>> val = conn.read()
>>> conn.close()
>>> val[0:25]
'\x1f\x8b\x08\x00\x00\x00\x00\x00\x04\x00\xed\xbd\x07`\x1cI\x96%&/m\xca{\x7fJ'

Yes, you are definitely getting gzip encoded data back.

Since you seem to be getting different results on different machines with the same version of Python, and in general it looks like the urllib2 API would require you do something special to request gzip encoded data, my guess is that you have a transparent proxy in there someplace.

I saw a presentation by the EFF at CodeCon in 2009. They were doing end-to-end connectivity testing to discover dirty ISP tricks of various kinds. One of the things they discovered while doing this testing is that a surprising number of consumer level NAT routers add random HTTP headers or do transparent proxying. You might have some piece of equipment on your network that's adding or modifying the Accept-Encoding header in order to make your connection seem faster.

来源：https://stackoverflow.com/questions/3028426/urllibs-urlopen-breaking-on-some-sites-e-g-stackapps-api-returns-garbage-re

标签

python

urllib2

urllib

urlopen