Retrying a Scrapy Request even when receiving a 200 status code

浪子不回头ぞ 提交于 2020-01-03 09:24:54

问题


There is a website I'm scraping that will sometimes return a 200, but not have any text in response.body (raises an AttributeError when I try to parse it with Selector).

Is there a simple way to check to make sure the body includes text, and if not, retry the request until it does? Here is some pseudocode to outline what I'm trying to do.

def check_response(response):
    if response.body != '':
        return response
    else:
        return Request(copy_of_response.request,
                       callback=check_response)

Basically, is there a way I can repeat a request with the exact same properties (method, url, payload, cookies, etc.)?


回答1:


Follow the EAFP principle:

Easier to ask for forgiveness than permission. This common Python coding style assumes the existence of valid keys or attributes and catches exceptions if the assumption proves false. This clean and fast style is characterized by the presence of many try and except statements. The technique contrasts with the LBYL style common to many other languages such as C.

Handle an exception and yield a Request to the current url with dont_filter=True:

dont_filter (boolean) – indicates that this request should not be filtered by the scheduler. This is used when you want to perform an identical request multiple times, to ignore the duplicates filter. Use it with care, or you will get into crawling loops. Default to False.

def parse(response):
    try:
        # parsing logic here
    except AttributeError:
        yield Request(response.url, callback=self.parse, dont_filter=True)

You can also make a copy of the current request (not tested):

new_request = response.request.copy()
new_request.dont_filter = True
yield new_request

Or, make a new request using replace():

new_request = response.request.replace(dont_filter=True)
yield new_request



回答2:


How about calling actual _rety() method from retry middleware, so it acts as a normal retry with all it's logic that takes settings into account?

In settings:

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
    'scraper.middlewares.retry.RetryMiddleware': 550,
}

Then your retry middleware could be like:

from scrapy.downloadermiddlewares.retry import RetryMiddleware \
    as BaseRetryMiddleware


class RetryMiddleware(BaseRetryMiddleware):


    def process_response(self, request, response, spider):
        # inject retry method so request could be retried by some conditions
        # from spider itself even on 200 responses
        if not hasattr(spider, '_retry'):
            spider._retry = self._retry
        return super(RetryMiddleware, self).process_response(request, response, spider)

And then in your success response callback you can call for ex.:

yield self._retry(response.request, ValueError, self)


来源:https://stackoverflow.com/questions/28640102/retrying-a-scrapy-request-even-when-receiving-a-200-status-code

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!