问题

I'm using Scrapy-splash and I have a problem with memory. I can clearly see that memory used by docker python3 is gradually increasing until PC freezes.

Can't figure out why it behaves this way because I have CONCURRENT_REQUESTS=3 and there is no way 3 HTML consumes 10GB RAM.

So there is a workaround to set maxrss to some reasonable value. When RAM usage has this value, docker is restarted so RAM is flushed.

But the problem is that for the time docker is down, scrapy continues sending requests so there is a couple of urls not scraped. Retry middleware is trying to retry these requests right now and then give up.

[scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.ex.com/eiB3t/ via http://127.0.0.1:8050/execute> (failed 2 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionDone: Connection was closed cleanly.>]
2019-03-30 14:28:33 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET https://www.ex.com/eiB3t/

So I have two questions

Do you know a better solution?
If not, how can I set Scrapy to retry request after some time (let's say on minute so docker has time to restart)?

回答1:

Method 1

One way would be to add a middleware to your Spider (source, linked):

# File: middlewares.py

from twisted.internet import reactor
from twisted.internet.defer import Deferred


class DelayedRequestsMiddleware(object):
    def process_request(self, request, spider):
        delay_s = request.meta.get('delay_request_by', None)
        if not delay_s:
            return

        deferred = Deferred()
        reactor.callLater(delay_s, deferred.callback, None)
        return deferred

Which you could later use in your Spider like this:

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    custom_settings = {
        'DOWNLOADER_MIDDLEWARES': {'middlewares.DelayedRequestsMiddleware': 123},
    }

    def start_requests(self):
        # This request will have itself delayed by 5 seconds
        yield scrapy.Request(url='http://quotes.toscrape.com/page/1/', 
                             meta={'delay_request_by': 5})
        # This request will not be delayed
        yield scrapy.Request(url='http://quotes.toscrape.com/page/2/')

    def parse(self, response):
        ...  # Process results here

Method 2

You could do this with a Custom Retry Middleware (source), you just need to override the process_response method of the current Retry Middleware:

from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message


class CustomRetryMiddleware(RetryMiddleware):

    def process_response(self, request, response, spider):
        if request.meta.get('dont_retry', False):
            return response
        if response.status in self.retry_http_codes:
            reason = response_status_message(response.status)

            # Your delay code here, for example sleep(10) or polling server until it is alive

            return self._retry(request, reason, spider) or response

        return response

Then enable it instead of the default RetryMiddleware in settings.py:

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
    'myproject.middlewarefilepath.CustomRetryMiddleware': 550,
}

回答2:

A more elaborate solution could be to set up a Kubernetes cluster in which you have multiple replicas running. This way you avoid having a failure of just 1 container impacting your scraping job.
I don't think it's easy to configure a waiting time only for retries. You could play with DOWNLOAD_DELAY (but this will impact delay between all requests), or set the RETRY_TIMES to a higher value than the default of 2.

来源：https://stackoverflow.com/questions/55431996/scrapy-set-delay-to-retry-middleware

标签

python