问题
There is a website I'm scraping that will sometimes return a 200, but not have any text in response.body (raises an AttributeError when I try to parse it with Selector).
Is there a simple way to check to make sure the body includes text, and if not, retry the request until it does? Here is some pseudocode to outline what I'm trying to do.
def check_response(response):
if response.body != '':
return response
else:
return Request(copy_of_response.request,
callback=check_response)
Basically, is there a way I can repeat a request with the exact same properties (method, url, payload, cookies, etc.)?
回答1:
Follow the EAFP principle:
Easier to ask for forgiveness than permission. This common Python coding style assumes the existence of valid keys or attributes and catches exceptions if the assumption proves false. This clean and fast style is characterized by the presence of many try and except statements. The technique contrasts with the LBYL style common to many other languages such as C.
Handle an exception and yield a Request
to the current url with dont_filter=True:
dont_filter (boolean) – indicates that this request should not be filtered by the scheduler. This is used when you want to perform an identical request multiple times, to ignore the duplicates filter. Use it with care, or you will get into crawling loops. Default to False.
def parse(response):
try:
# parsing logic here
except AttributeError:
yield Request(response.url, callback=self.parse, dont_filter=True)
You can also make a copy of the current request (not tested):
new_request = response.request.copy()
new_request.dont_filter = True
yield new_request
Or, make a new request using replace():
new_request = response.request.replace(dont_filter=True)
yield new_request
回答2:
How about calling actual _rety()
method from retry middleware, so it acts as a normal retry with all it's logic that takes settings into account?
In settings:
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
'scraper.middlewares.retry.RetryMiddleware': 550,
}
Then your retry middleware could be like:
from scrapy.downloadermiddlewares.retry import RetryMiddleware \
as BaseRetryMiddleware
class RetryMiddleware(BaseRetryMiddleware):
def process_response(self, request, response, spider):
# inject retry method so request could be retried by some conditions
# from spider itself even on 200 responses
if not hasattr(spider, '_retry'):
spider._retry = self._retry
return super(RetryMiddleware, self).process_response(request, response, spider)
And then in your success response callback you can call for ex.:
yield self._retry(response.request, ValueError, self)
来源:https://stackoverflow.com/questions/28640102/retrying-a-scrapy-request-even-when-receiving-a-200-status-code