Why does this url raise BadStatusLine with httplib2 and urllib2?

Using httplib2 and urllib2, I'm trying to fetch pages from this url, but all of them didn't work out and ended up with this exception.

content = conn.request(uri="http://www.zdnet.co.kr/news/news_print.asp?artice_id=20110727092902")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/dist-packages/httplib2/__init__.py", line 1129, in request
    (response, content) = self._request(conn, authority, uri, request_uri, method, body, headers, redirections, cachekey)
  File "/usr/lib/python2.7/dist-packages/httplib2/__init__.py", line 901, in _request
    (response, content) = self._conn_request(conn, request_uri, method, body, headers)
  File "/usr/lib/python2.7/dist-packages/httplib2/__init__.py", line 871, in _conn_request
    response = conn.getresponse()
  File "/usr/lib/python2.7/httplib.py", line 1027, in getresponse
    response.begin()
  File "/usr/lib/python2.7/httplib.py", line 407, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python2.7/httplib.py", line 371, in _read_status
    raise BadStatusLine(line)

HTTP header was like this

http://www.zdnet.co.kr/news/news_print.asp?artice_id=20110727092902

GET /news/news_print.asp?artice_id=20110727092902 HTTP/1.1
Host: www.zdnet.co.kr
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:10.0.1) Gecko/20100101 Firefox/10.0.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: ko-kr,ko;q=0.8,en-us;q=0.5,en;q=0.3
Accept-Encoding: gzip, deflate
Connection: keep-alive
Cookie: RMID=7d83495d4f336fe0; __utma=37206251.1552605885.1328771258.1328771258.1329070845.2; __utmz=37206251.1328771258.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); ASPSESSIONIDCSQCQTDD=BCLEHPPDEPHEBJDLCFNDMKDN; __utmc=37206251; ASPSESSIONIDSSQCQQCB=MJPLMOJAFPDFCLONCANBIKHN; _EXEN=2
X-FireLogger: 1.2

HTTP/1.1 200 OK
Date: Mon, 13 Feb 2012 18:02:56 GMT
Content-Length: 19158
Content-Type: text/html;charset=UTF-8; Charset=UTF-8
Set-Cookie: ASPSESSIONIDSQSDQRDB=NGAIFHKAGDIOGEMANAOLLKKF; path=/
Cache-Control: private

Any clue?

This works fine for me:

import urllib2

opener = urllib2.build_opener()

headers = {
  'User-Agent': 'Mozilla/5.0 (Windows NT 5.1; rv:10.0.1) Gecko/20100101 Firefox/10.0.1',
}

opener.addheaders = headers.items()
response = opener.open("http://www.zdnet.co.kr/news/news_print.asp?artice_id=20110727092902")

print response.headers
print response.read()

The website discards all requests that occur without a User-Agent string.

For all the people that end up here with a similar problem after installing httplib2 0.8:

Version 0.8 has a regression with connection handling in relation with HTTP keep-alive. See the bug report: https://code.google.com/p/httplib2/issues/detail?id=250

There is a fix for this issue, but it has not been released so far. Until then just use httplib2 0.7.7.

In my code,when i use

    from urllib2 import urlopen  
    content = urlopen(page).read()

the exception appears. However, when i use

    import urllib  
    content = urllib.urlopen(page).read()

everything is ok. Maybe it will help u.

Look like this webpage doesn't allow your user agent. You can change it like this:

>>> import urllib2
>>> user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
>>> headers = { 'User-Agent' : user_agent }
>>> r = urllib2.Request('http://www.zdnet.co.kr/news/news_print.asp?artice_id=20110727092902', headers=headers)
>>> fd = urllib2.urlopen(r)
>>> print fd[20:]
'<!DOCTYPE html PUBLI'

来源：https://stackoverflow.com/questions/9265616/why-does-this-url-raise-badstatusline-with-httplib2-and-urllib2

标签

python

urllib2

httplib2