问题
Earlier today I was able to pull data from Google Patents using the code below
import urllib2
url = 'http://www.google.com/search?tbo=p&q=ininventor:"John-Mudd"&hl=en&tbm=pts&source=lnt&tbs=ptso:us'
req = urllib2.Request(url, headers={'User-Agent' : "foobar"})
response = urllib2.urlopen(req)
Now when I go to run it I get the following 503 error. I had only looped through this code maybe 30 times on it (i'm trying to get all the patents owned by a list of 30 people).
HTTPError Traceback (most recent call last)
<ipython-input-4-01f83e2c218f> in <module>()
----> 1 response = urllib2.urlopen(req)
C:\Python27\lib\urllib2.pyc in urlopen(url, data, timeout)
124 if _opener is None:
125 _opener = build_opener()
--> 126 return _opener.open(url, data, timeout)
127
128 def install_opener(opener):
C:\Python27\lib\urllib2.pyc in open(self, fullurl, data, timeout)
404 for processor in self.process_response.get(protocol, []):
405 meth = getattr(processor, meth_name)
--> 406 response = meth(req, response)
407
408 return response
C:\Python27\lib\urllib2.pyc in http_response(self, request, response)
517 if not (200 <= code < 300):
518 response = self.parent.error(
--> 519 'http', request, response, code, msg, hdrs)
520
521 return response
C:\Python27\lib\urllib2.pyc in error(self, proto, *args)
436 http_err = 0
437 args = (dict, proto, meth_name) + args
--> 438 result = self._call_chain(*args)
439 if result:
440 return result
C:\Python27\lib\urllib2.pyc in _call_chain(self, chain, kind, meth_name, *args)
376 func = getattr(handler, meth_name)
377
--> 378 result = func(*args)
379 if result is not None:
380 return result
C:\Python27\lib\urllib2.pyc in http_error_302(self, req, fp, code, msg, headers)
623 fp.close()
624
--> 625 return self.parent.open(new, timeout=req.timeout)
626
627 http_error_301 = http_error_303 = http_error_307 = http_error_302
C:\Python27\lib\urllib2.pyc in open(self, fullurl, data, timeout)
404 for processor in self.process_response.get(protocol, []):
405 meth = getattr(processor, meth_name)
--> 406 response = meth(req, response)
407
408 return response
C:\Python27\lib\urllib2.pyc in http_response(self, request, response)
517 if not (200 <= code < 300):
518 response = self.parent.error(
--> 519 'http', request, response, code, msg, hdrs)
520
521 return response
C:\Python27\lib\urllib2.pyc in error(self, proto, *args)
442 if http_err:
443 args = (dict, 'default', 'http_error_default') + orig_args
--> 444 return self._call_chain(*args)
445
446 # XXX probably also want an abstract factory that knows when it makes
C:\Python27\lib\urllib2.pyc in _call_chain(self, chain, kind, meth_name, *args)
376 func = getattr(handler, meth_name)
377
--> 378 result = func(*args)
379 if result is not None:
380 return result
C:\Python27\lib\urllib2.pyc in http_error_default(self, req, fp, code, msg, hdrs)
525 class HTTPDefaultErrorHandler(BaseHandler):
526 def http_error_default(self, req, fp, code, msg, hdrs):
--> 527 raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
528
529 class HTTPRedirectHandler(BaseHandler):
HTTPError: HTTP Error 503: Service Unavailable
回答1:
Shot in the dark guess:
Did you look to see if there was a "Retry-After header" in the response. It's a real possibility with 503.
From RFC 2616:
14.37 Retry-After
The Retry-After response-header field can be used with a 503 (Service Unavailable) response to indicate how long the service is expected to be unavailable to the requesting client. This field MAY also be used with any 3xx (Redirection) response to indicate the minimum time the user-agent is asked wait before issuing the redirected request. The value of this field can be either an HTTP-date or an integer number of seconds (in decimal) after the time of the response. Retry-After = "Retry-After" ":" ( HTTP-date | delta-seconds )
Two examples of its use are Retry-After: Fri, 31 Dec 1999 23:59:59 GMT Retry-After: 120
In the latter example, the delay is 2 minutes.
回答2:
Google's TOS bans automated queries, sadly enough. It almost certainly detected that you were "up to no good."
source: https://support.google.com/websearch/answer/86640?hl=en
来源:https://stackoverflow.com/questions/15506651/503-error-when-trying-to-access-google-patents-using-python