Running dozens of Scrapy spiders in a controlled manner

后端未结

关注

 3  1734

北恋 2021-02-10 02:01

I\'m trying to build a system to run a few dozen Scrapy spiders, save the results to S3, and let me know when it finishes. There are several similar questions on StackOverflow (

3条回答

情话喂你 (楼主)

2021-02-10 02:11

One solution, if the information is relatively static (based on your mention of the process "finishing"), is to simply set up a script to run the crawls sequentially or in batches. Wait for 1 to finish before starting the next 1 (or 10, or whatever the batch size is).

Another thing to consider if you're only using one machine and this error is cropping up -- having too many files open isn't really a resource bottleneck. You might be better off having each spider run 200 or so threads to make network IO (typically, though sometimes CPU or whatnot) the bottleneck. Each spider will finish faster on average than your current solution which executes them all at once and hits some "maximum file descriptor" limit rather than an actual resource limit.

0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...