Running dozens of Scrapy spiders in a controlled manner

后端 未结 3 1734
北恋
北恋 2021-02-10 02:01

I\'m trying to build a system to run a few dozen Scrapy spiders, save the results to S3, and let me know when it finishes. There are several similar questions on StackOverflow (

3条回答
  •  情话喂你
    2021-02-10 02:11

    One solution, if the information is relatively static (based on your mention of the process "finishing"), is to simply set up a script to run the crawls sequentially or in batches. Wait for 1 to finish before starting the next 1 (or 10, or whatever the batch size is).

    Another thing to consider if you're only using one machine and this error is cropping up -- having too many files open isn't really a resource bottleneck. You might be better off having each spider run 200 or so threads to make network IO (typically, though sometimes CPU or whatnot) the bottleneck. Each spider will finish faster on average than your current solution which executes them all at once and hits some "maximum file descriptor" limit rather than an actual resource limit.

提交回复
热议问题