Running dozens of Scrapy spiders in a controlled manner

后端 未结 3 1735
北恋
北恋 2021-02-10 02:01

I\'m trying to build a system to run a few dozen Scrapy spiders, save the results to S3, and let me know when it finishes. There are several similar questions on StackOverflow (

相关标签:
3条回答
  • 2021-02-10 02:11

    One solution, if the information is relatively static (based on your mention of the process "finishing"), is to simply set up a script to run the crawls sequentially or in batches. Wait for 1 to finish before starting the next 1 (or 10, or whatever the batch size is).

    Another thing to consider if you're only using one machine and this error is cropping up -- having too many files open isn't really a resource bottleneck. You might be better off having each spider run 200 or so threads to make network IO (typically, though sometimes CPU or whatnot) the bottleneck. Each spider will finish faster on average than your current solution which executes them all at once and hits some "maximum file descriptor" limit rather than an actual resource limit.

    0 讨论(0)
  • 2021-02-10 02:13

    The simplest way to do this is to run them all from the command line. For example:

    $ scrapy list | xargs -P 4 -n 1 scrapy crawl
    

    Will run all your spiders, with up to 4 running in parallel at any time. You can then send a notification in a script once this command has completed.

    A more robust option is to use scrapyd. This comes with an API, a minimal web interface, etc. It will also queue the crawls and only run a certain (configurable) number at once. You can interact with it via the API to start your spiders and send notifications once they are all complete.

    Scrapy Cloud is a perfect fit for this [disclaimer: I work for Scrapinghub]. It will allow you only to run a certain number at once and has a queue of pending jobs (which you can modify, browse online, prioritize, etc.) and a more complete API than scrapyd.

    You shouldn't run all your spiders in a single process. It will probably be slower, can introduce unforeseen bugs, and you may hit resource limits (like you did). If you run them separately using any of the options above, just run enough to max out your hardware resources (usually CPU/network). If you still get problems with file descriptors at that point you should increase the limit.

    0 讨论(0)
  • 2021-02-10 02:20

    it eventually locks up and fails because it attempts to open too many file descriptors on the system that runs it

    That's probably a sign that you need multiple machines to execute your spiders. A scalability issue. Well, you can also scale vertically to make your single machine more powerful but that would hit a "limit" much faster:

    • Difference between scaling horizontally and vertically for databases

    Check out the Distributed Crawling documentation and the scrapyd project.

    There is also a cloud-based distributed crawling service called ScrapingHub which would take away the scalability problems from you altogether (note that I am not advertising them as I have no affiliation to the company).

    0 讨论(0)
提交回复
热议问题