问题
I'm using Scrapy-splash
and I have a problem with memory. I can clearly see that memory used by docker
python3
is gradually increasing until PC freezes.
Can't figure out why it behaves this way because I have CONCURRENT_REQUESTS=3
and there is no way 3 HTML
consumes 10GB RAM.
So there is a workaround to set maxrss
to some reasonable value. When RAM usage has this value, docker is restarted so RAM is flushed.
But the problem is that for the time docker
is down, scrapy
continues sending requests so there is a couple of urls
not scraped. Retry middleware is trying to retry these requests right now and then give up.
[scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.ex.com/eiB3t/ via http://127.0.0.1:8050/execute> (failed 2 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionDone: Connection was closed cleanly.>]
2019-03-30 14:28:33 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET https://www.ex.com/eiB3t/
So I have two questions
- Do you know a better solution?
- If not, how can I set
Scrapy
toretry
request after some time (let's say on minute sodocker
has time to restart)?
回答1:
Method 1
One way would be to add a middleware to your Spider (source, linked):
# File: middlewares.py
from twisted.internet import reactor
from twisted.internet.defer import Deferred
class DelayedRequestsMiddleware(object):
def process_request(self, request, spider):
delay_s = request.meta.get('delay_request_by', None)
if not delay_s:
return
deferred = Deferred()
reactor.callLater(delay_s, deferred.callback, None)
return deferred
Which you could later use in your Spider like this:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
custom_settings = {
'DOWNLOADER_MIDDLEWARES': {'middlewares.DelayedRequestsMiddleware': 123},
}
def start_requests(self):
# This request will have itself delayed by 5 seconds
yield scrapy.Request(url='http://quotes.toscrape.com/page/1/',
meta={'delay_request_by': 5})
# This request will not be delayed
yield scrapy.Request(url='http://quotes.toscrape.com/page/2/')
def parse(self, response):
... # Process results here
Method 2
You could do this with a Custom Retry Middleware (source), you just need to override the process_response
method of the current Retry Middleware:
from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message
class CustomRetryMiddleware(RetryMiddleware):
def process_response(self, request, response, spider):
if request.meta.get('dont_retry', False):
return response
if response.status in self.retry_http_codes:
reason = response_status_message(response.status)
# Your delay code here, for example sleep(10) or polling server until it is alive
return self._retry(request, reason, spider) or response
return response
Then enable it instead of the default RetryMiddleware
in settings.py
:
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
'myproject.middlewarefilepath.CustomRetryMiddleware': 550,
}
回答2:
A more elaborate solution could be to set up a Kubernetes cluster in which you have multiple replicas running. This way you avoid having a failure of just 1 container impacting your scraping job.
I don't think it's easy to configure a waiting time only for retries. You could play with DOWNLOAD_DELAY (but this will impact delay between all requests), or set the RETRY_TIMES to a higher value than the default of 2.
来源:https://stackoverflow.com/questions/55431996/scrapy-set-delay-to-retry-middleware