Scrapy CrawlSpider retry scrape

前端 未结 1 1097
小鲜肉
小鲜肉 2021-01-15 08:04

For a page that I\'m trying to scrape, I sometimes get a \"placeholder\" page back in my response that contains some javascript that autoreloads until it gets the real page.

相关标签:
1条回答
  • 2021-01-15 08:11

    I would think about having a custom Retry Middleware instead - similar to a built-in one.

    Sample implementation (not tested):

    import logging
    
    logger = logging.getLogger(__name__)
    
    
    class RetryMiddleware(object):
        def process_response(self, request, response, spider):
            if 'var PageIsLoaded = false;' in response.body:
                logger.warning('parse_page encountered an incomplete rendering of {}'.format(response.url))
                return self._retry(request) or response
    
            return response
    
        def _retry(self, request):
            logger.debug("Retrying %(request)s", {'request': request})
    
            retryreq = request.copy()
            retryreq.dont_filter = True
            return retryreq
    

    And don't forget to activate it.

    0 讨论(0)
提交回复
热议问题