问题
For a page that I'm trying to scrape, I sometimes get a "placeholder" page back in my response that contains some javascript that autoreloads until it gets the real page. I can detect when this happens and I want to retry downloading and scraping the page. The logic that I use in my CrawlSpider is something like:
def parse_page(self, response):
url = response.url
# Check to make sure the page is loaded
if 'var PageIsLoaded = false;' in response.body:
self.logger.warning('parse_page encountered an incomplete rendering of {}'.format(url))
yield Request(url, self.parse, dont_filter=True)
return
...
# Normal parsing logic
However, it seems like when the retry logic gets called and a new Request is issued, the pages and the links they contain don't get crawled or scraped. My thought was that by using self.parse
which the CrawlSpider uses to apply the crawl rules and dont_filter=True
, I could avoid the duplicate filter. However with DUPEFILTER_DEBUG = True
, I can see that the retry requests get filtered away.
Am I missing something, or is there a better way to handle this? I'd like to avoid the complication of doing dynamic js rendering using something like splash if possible, and this only happens intermittently.
回答1:
I would think about having a custom Retry Middleware instead - similar to a built-in one.
Sample implementation (not tested):
import logging
logger = logging.getLogger(__name__)
class RetryMiddleware(object):
def process_response(self, request, response, spider):
if 'var PageIsLoaded = false;' in response.body:
logger.warning('parse_page encountered an incomplete rendering of {}'.format(response.url))
return self._retry(request) or response
return response
def _retry(self, request):
logger.debug("Retrying %(request)s", {'request': request})
retryreq = request.copy()
retryreq.dont_filter = True
return retryreq
And don't forget to activate it.
来源:https://stackoverflow.com/questions/32664022/scrapy-crawlspider-retry-scrape