How to limit scrapy request objects?

问题

So I have a spider that I thought was leaking memory, turns out it is just grabbing too many links from link rich pages (sometimes it puts upwards of 100,000) when I check the telnet console >>> prefs()

Now I have been over the docs and google again and again and I can't find a way to limit the requests that the spider takes in. What I want is to be able to tell it to hold back on taking requests once a certain amount goes into the scheduler. I have tried setting a DEPTH_LIMIT but that only lets it grab a large amount and then run the callback on the ones that it has grabbed.

It seems like a fairly straightforward thing to do and I am sure people have run into this problem before, so I know there must be a way to get it done. Any ideas?

EDIT: Here is the output from MEMUSAGE_ENABLE = True

     {'downloader/request_bytes': 105716,
     'downloader/request_count': 315,
     'downloader/request_method_count/GET': 315,
     'downloader/response_bytes': 10066538,
     'downloader/response_count': 315,
     'downloader/response_status_count/200': 313,
     'downloader/response_status_count/301': 1,
     'downloader/response_status_count/302': 1,
     'dupefilter/filtered': 32444,
     'finish_reason': 'memusage_exceeded',
     'finish_time': datetime.datetime(2015, 1, 14, 14, 2, 38, 134402),
     'item_scraped_count': 312,
     'log_count/DEBUG': 946,
     'log_count/ERROR': 2,
     'log_count/INFO': 9,
     'memdebug/gc_garbage_count': 0,
     'memdebug/live_refs/EnglishWikiSpider': 1,
     'memdebug/live_refs/Request': 70194,
     'memusage/limit_notified': 1,
     'memusage/limit_reached': 1,
     'memusage/max': 422600704,
     'memusage/startup': 34791424,
     'offsite/domains': 316,
     'offsite/filtered': 18172,
     'request_depth_max': 3,
     'response_received_count': 313,
     'scheduler/dequeued': 315,
     'scheduler/dequeued/memory': 315,
     'scheduler/enqueued': 70508,
     'scheduler/enqueued/memory': 70508,
     'start_time': datetime.datetime(2015, 1, 14, 14, 1, 31, 988254)}

回答1:

I solved my problem, the answer was really hard to track down so I posted it here in case anyone else comes across the same problem.

After sifting through scrapy code and referring back to the docs, I could see that scrapy kept all requests in memory, I already deduced that, but in the code there is also some checks to see if there is a job directory in which to write pending requests to disk (in core.scheduler)

So, if you run the scrapy spider with a job directory, it will write pending requests to disk and then retrieve them from disk instead of storing them all in memory.

$ scrapy crawl spider -s JOBDIR=somedirname

when I do this, if I enter the telnet console, I can see that my number of requests in memory is always about 25, and I have 100,000+ written to disk, exactly how I wanted it to run.

It seems like this would be a common problem, given that one would be crawling a large site that has multiple extractable links for every page. I am surprised it is not more documented or easier to find.

http://doc.scrapy.org/en/latest/topics/jobs.html the scrapy site there states that the main purpose is for pausing and resuming later, but it works this way as well.

来源：https://stackoverflow.com/questions/27943970/how-to-limit-scrapy-request-objects

标签

python

web-scraping

scrapy

web-crawler

bots