Scrapy CrawlSpider retry scrape

我的未来我决定 提交于 2019-12-30 11:24:08

问题


For a page that I'm trying to scrape, I sometimes get a "placeholder" page back in my response that contains some javascript that autoreloads until it gets the real page. I can detect when this happens and I want to retry downloading and scraping the page. The logic that I use in my CrawlSpider is something like:

def parse_page(self, response):
    url = response.url

    # Check to make sure the page is loaded
    if 'var PageIsLoaded = false;' in response.body:
        self.logger.warning('parse_page encountered an incomplete rendering of {}'.format(url))
        yield Request(url, self.parse, dont_filter=True)
        return

    ...
    # Normal parsing logic

However, it seems like when the retry logic gets called and a new Request is issued, the pages and the links they contain don't get crawled or scraped. My thought was that by using self.parse which the CrawlSpider uses to apply the crawl rules and dont_filter=True, I could avoid the duplicate filter. However with DUPEFILTER_DEBUG = True, I can see that the retry requests get filtered away.

Am I missing something, or is there a better way to handle this? I'd like to avoid the complication of doing dynamic js rendering using something like splash if possible, and this only happens intermittently.


回答1:


I would think about having a custom Retry Middleware instead - similar to a built-in one.

Sample implementation (not tested):

import logging

logger = logging.getLogger(__name__)


class RetryMiddleware(object):
    def process_response(self, request, response, spider):
        if 'var PageIsLoaded = false;' in response.body:
            logger.warning('parse_page encountered an incomplete rendering of {}'.format(response.url))
            return self._retry(request) or response

        return response

    def _retry(self, request):
        logger.debug("Retrying %(request)s", {'request': request})

        retryreq = request.copy()
        retryreq.dont_filter = True
        return retryreq

And don't forget to activate it.



来源:https://stackoverflow.com/questions/32664022/scrapy-crawlspider-retry-scrape

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!