>>> import urllib2
>>> good_article = \'http://en.wikipedia.org/wiki/Wikipedia\'
>>> bad_article = \'http://en.wikipedia.org/wiki/India\'
I think there is something else causing you a problem. That series of bytes looks some encoded content.
import urllib2
bad_article = 'http://en.wikipedia.org/wiki/India'
req = urllib2.Request(bad_article)
req.add_header('User-Agent', 'Mozilla/5.0')
result = urllib2.urlopen(req)
print result.readline()
resulted in this
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
which is correct.
Do a "curl -i" for both the links. If its coming fine, there is no environment problem.
It's not an environment, locale, or encoding problem. The offending stream of bytes is gzip-compressed. The \x1f\x8B
at the start is what you get at the start of a gzip stream with the default settings.
Looks as though the server is ignoring the fact that you didn't do
req2.add_header('Accept-encoding', 'gzip')
You should look at result.headers.getheader('Content-Encoding')
and if necessary, decompress it yourself.