Scrapy retry or redirect middleware

前端 未结 2 1670
执念已碎
执念已碎 2021-02-03 14:27

While crawling through a site with scrapy, I get redirected to a user-blocked page about 1/5th of the time. I lose the pages that I get redirected from when that happe

相关标签:
2条回答
  • 2021-02-03 14:56

    You can handle 302 responses by adding handle_httpstatus_list = [302] at the beginning of your spider like so:

    class MySpider(CrawlSpider):
        handle_httpstatus_list = [302]
    
        def parse(self, response):
            if response.status == 302:
                # Store response.url somewhere and go back to it later
    
    0 讨论(0)
  • 2021-02-03 15:04

    I had the same problem today with a website that used 301..303 redirects, but also sometimes meta redirect. I've build a retry middleware and used some chunks from the redirect middlewares:

    from scrapy.contrib.downloadermiddleware.retry import RetryMiddleware
    from scrapy.selector import HtmlXPathSelector
    from scrapy.utils.response import get_meta_refresh
    from scrapy import log
    
    class CustomRetryMiddleware(RetryMiddleware):
    
        def process_response(self, request, response, spider):
            url = response.url
            if response.status in [301, 307]:
                log.msg("trying to redirect us: %s" %url, level=log.INFO)
                reason = 'redirect %d' %response.status
                return self._retry(request, reason, spider) or response
            interval, redirect_url = get_meta_refresh(response)
            # handle meta redirect
            if redirect_url:
                log.msg("trying to redirect us: %s" %url, level=log.INFO)
                reason = 'meta'
                return self._retry(request, reason, spider) or response
            hxs = HtmlXPathSelector(response)
            # test for captcha page
            captcha = hxs.select(".//input[contains(@id, 'captchacharacters')]").extract()
            if captcha:
                log.msg("captcha page %s" %url, level=log.INFO)
                reason = 'capcha'
                return self._retry(request, reason, spider) or response
            return response
    

    In order to use this middleware it's probably best to disable the exiting redirect middlewares for this project in settings.py:

    DOWNLOADER_MIDDLEWARES = {
                             'YOUR_PROJECT.scraper.middlewares.CustomRetryMiddleware': 120,
                              'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': None,
                              'scrapy.contrib.downloadermiddleware.redirect.MetaRefreshMiddleware': None,
    }
    
    0 讨论(0)
提交回复
热议问题