Running dozens of Scrapy spiders in a controlled manner

后端未结

关注

 3  1741

北恋 2021-02-10 02:01

I\'m trying to build a system to run a few dozen Scrapy spiders, save the results to S3, and let me know when it finishes. There are several similar questions on StackOverflow (

3条回答

Happy的楠姐 (楼主)

2021-02-10 02:20
it eventually locks up and fails because it attempts to open too many file descriptors on the system that runs it

That's probably a sign that you need multiple machines to execute your spiders. A scalability issue. Well, you can also scale vertically to make your single machine more powerful but that would hit a "limit" much faster:
- Difference between scaling horizontally and vertically for databases
Check out the Distributed Crawling documentation and the scrapyd project.

There is also a cloud-based distributed crawling service called ScrapingHub which would take away the scalability problems from you altogether (note that I am not advertising them as I have no affiliation to the company).
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...