问题
So I have a spider that I thought was leaking memory, turns out it is just grabbing too many links from link rich pages (sometimes it puts upwards of 100,000) when I check the telnet console >>> prefs()
Now I have been over the docs and google again and again and I can't find a way to limit the requests that the spider takes in. What I want is to be able to tell it to hold back on taking requests once a certain amount goes into the scheduler. I have tried setting a DEPTH_LIMIT
but that only lets it grab a large amount and then run the callback on the ones that it has grabbed.
It seems like a fairly straightforward thing to do and I am sure people have run into this problem before, so I know there must be a way to get it done. Any ideas?
EDIT: Here is the output from MEMUSAGE_ENABLE = True
{'downloader/request_bytes': 105716,
'downloader/request_count': 315,
'downloader/request_method_count/GET': 315,
'downloader/response_bytes': 10066538,
'downloader/response_count': 315,
'downloader/response_status_count/200': 313,
'downloader/response_status_count/301': 1,
'downloader/response_status_count/302': 1,
'dupefilter/filtered': 32444,
'finish_reason': 'memusage_exceeded',
'finish_time': datetime.datetime(2015, 1, 14, 14, 2, 38, 134402),
'item_scraped_count': 312,
'log_count/DEBUG': 946,
'log_count/ERROR': 2,
'log_count/INFO': 9,
'memdebug/gc_garbage_count': 0,
'memdebug/live_refs/EnglishWikiSpider': 1,
'memdebug/live_refs/Request': 70194,
'memusage/limit_notified': 1,
'memusage/limit_reached': 1,
'memusage/max': 422600704,
'memusage/startup': 34791424,
'offsite/domains': 316,
'offsite/filtered': 18172,
'request_depth_max': 3,
'response_received_count': 313,
'scheduler/dequeued': 315,
'scheduler/dequeued/memory': 315,
'scheduler/enqueued': 70508,
'scheduler/enqueued/memory': 70508,
'start_time': datetime.datetime(2015, 1, 14, 14, 1, 31, 988254)}
回答1:
I solved my problem, the answer was really hard to track down so I posted it here in case anyone else comes across the same problem.
After sifting through scrapy code and referring back to the docs, I could see that scrapy kept all requests in memory, I already deduced that, but in the code there is also some checks to see if there is a job directory in which to write pending requests to disk (in core.scheduler)
So, if you run the scrapy spider with a job directory, it will write pending requests to disk and then retrieve them from disk instead of storing them all in memory.
$ scrapy crawl spider -s JOBDIR=somedirname
when I do this, if I enter the telnet console, I can see that my number of requests in memory is always about 25, and I have 100,000+ written to disk, exactly how I wanted it to run.
It seems like this would be a common problem, given that one would be crawling a large site that has multiple extractable links for every page. I am surprised it is not more documented or easier to find.
http://doc.scrapy.org/en/latest/topics/jobs.html the scrapy site there states that the main purpose is for pausing and resuming later, but it works this way as well.
来源:https://stackoverflow.com/questions/27943970/how-to-limit-scrapy-request-objects