问题
I am not able to open one particular url using urllib2. Same approach works well with other websites such as "http://www.google.com" but not this site (which also displays fine in the browser).
my simple code:
from BeautifulSoup import BeautifulSoup
import urllib2
url="http://www.experts.scival.com/einstein/"
response=urllib2.urlopen(url)
html=response.read()
soup=BeautifulSoup(html)
print soup
Can anyone help me to make it work?
this is error I got:
Traceback (most recent call last):
File "/Users/jontaotao/Documents/workspace/MedicalSchoolInfo/src/AlbertEinsteinCollegeOfMedicine_SciValExperts/getlink.py", line 12, in <module>
response=urllib2.urlopen(url);
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 400, in open
response = meth(req, response)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 513, in http_response
'http', request, response, code, msg, hdrs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 432, in error
result = self._call_chain(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 372, in _call_chain
result = func(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 619, in http_error_302
return self.parent.open(new, timeout=req.timeout)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 400, in open
response = meth(req, response)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 513, in http_response
'http', request, response, code, msg, hdrs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 438, in error
return self._call_chain(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 372, in _call_chain
result = func(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 521, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 404: Not Found
Thank you
回答1:
I just tried this and received 404 code and page back.
At a guess it's doing User-Agent detection which either by accident or on purpose doesn't serve content to python urllib.
Clarification, with urllib
, I received the urlopen
returned a response object with a 404 code and HTML content. With urllib2.urlopen
an urllib2.HTTPError
exception was raised.
I'd suggest you try setting your User Agent to something that looks like a browser. There's a question about this here: Changing user agent on urllib2.urlopen
回答2:
You can use try except
to capture an Error
try:
u = urllib2.urlopen(req)
except urllib2.HTTPError, e:
print e.code
print e.msg
return
回答3:
hm... are you sure that URL is valid? try "http://www.google.com" I had similar code and there is no problems with urllib. Or you can use try - except statement to see error's details. And of course MattH's answer is very similar to the truth :)
来源:https://stackoverflow.com/questions/12302304/urllib2-returns-404-for-a-website-which-displays-fine-in-browsers