urllib2 returns 404 for a website which displays fine in browsers

生来就可爱ヽ(ⅴ<●) 提交于 2020-01-10 01:23:30

问题


I am not able to open one particular url using urllib2. Same approach works well with other websites such as "http://www.google.com" but not this site (which also displays fine in the browser).

my simple code:

from BeautifulSoup import BeautifulSoup
import urllib2

url="http://www.experts.scival.com/einstein/"
response=urllib2.urlopen(url)
html=response.read()
soup=BeautifulSoup(html)
print soup

Can anyone help me to make it work?

this is error I got:

Traceback (most recent call last):
  File "/Users/jontaotao/Documents/workspace/MedicalSchoolInfo/src/AlbertEinsteinCollegeOfMedicine_SciValExperts/getlink.py", line 12, in <module>
    response=urllib2.urlopen(url);
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 126, in urlopen
    return _opener.open(url, data, timeout)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 400, in open
    response = meth(req, response)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 513, in http_response
    'http', request, response, code, msg, hdrs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 432, in error
    result = self._call_chain(*args)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 372, in _call_chain
    result = func(*args)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 619, in http_error_302
    return self.parent.open(new, timeout=req.timeout)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 400, in open
    response = meth(req, response)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 513, in http_response
    'http', request, response, code, msg, hdrs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 438, in error
    return self._call_chain(*args)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 372, in _call_chain
    result = func(*args)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 521, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 404: Not Found

Thank you


回答1:


I just tried this and received 404 code and page back.

At a guess it's doing User-Agent detection which either by accident or on purpose doesn't serve content to python urllib.

Clarification, with urllib, I received the urlopen returned a response object with a 404 code and HTML content. With urllib2.urlopen an urllib2.HTTPError exception was raised.

I'd suggest you try setting your User Agent to something that looks like a browser. There's a question about this here: Changing user agent on urllib2.urlopen




回答2:


You can use try except to capture an Error

try:
    u = urllib2.urlopen(req)
except urllib2.HTTPError, e:
    print e.code
    print e.msg
    return



回答3:


hm... are you sure that URL is valid? try "http://www.google.com" I had similar code and there is no problems with urllib. Or you can use try - except statement to see error's details. And of course MattH's answer is very similar to the truth :)



来源:https://stackoverflow.com/questions/12302304/urllib2-returns-404-for-a-website-which-displays-fine-in-browsers

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!