Scrapy retry or redirect middleware

前端 未结 2 1671
执念已碎
执念已碎 2021-02-03 14:27

While crawling through a site with scrapy, I get redirected to a user-blocked page about 1/5th of the time. I lose the pages that I get redirected from when that happe

2条回答
  •  醉话见心
    2021-02-03 15:04

    I had the same problem today with a website that used 301..303 redirects, but also sometimes meta redirect. I've build a retry middleware and used some chunks from the redirect middlewares:

    from scrapy.contrib.downloadermiddleware.retry import RetryMiddleware
    from scrapy.selector import HtmlXPathSelector
    from scrapy.utils.response import get_meta_refresh
    from scrapy import log
    
    class CustomRetryMiddleware(RetryMiddleware):
    
        def process_response(self, request, response, spider):
            url = response.url
            if response.status in [301, 307]:
                log.msg("trying to redirect us: %s" %url, level=log.INFO)
                reason = 'redirect %d' %response.status
                return self._retry(request, reason, spider) or response
            interval, redirect_url = get_meta_refresh(response)
            # handle meta redirect
            if redirect_url:
                log.msg("trying to redirect us: %s" %url, level=log.INFO)
                reason = 'meta'
                return self._retry(request, reason, spider) or response
            hxs = HtmlXPathSelector(response)
            # test for captcha page
            captcha = hxs.select(".//input[contains(@id, 'captchacharacters')]").extract()
            if captcha:
                log.msg("captcha page %s" %url, level=log.INFO)
                reason = 'capcha'
                return self._retry(request, reason, spider) or response
            return response
    

    In order to use this middleware it's probably best to disable the exiting redirect middlewares for this project in settings.py:

    DOWNLOADER_MIDDLEWARES = {
                             'YOUR_PROJECT.scraper.middlewares.CustomRetryMiddleware': 120,
                              'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': None,
                              'scrapy.contrib.downloadermiddleware.redirect.MetaRefreshMiddleware': None,
    }
    

提交回复
热议问题