CPU-intensive parsing with scrapy

前端 未结 2 527
[愿得一人]
[愿得一人] 2021-01-15 00:34

The CONCURRENT_ITEMS section at http://doc.scrapy.org/en/latest/topics/settings.html#concurrent-items defines it as:

Maximum number of concurrent item

相关标签:
2条回答
  • 2021-01-15 01:07

    The CONCURRENT_ITEMS setting refers to limiting the concurrent activity when processing items from the spider output. By concurrent activity, I mean what twisted (the underlying framework used by Scrapy) will do concurrently - usually it's stuff like network requests.

    Scrapy does not use multithreading and will not use more than one core. If your spider is CPU bound, the usual way to speed up is to use multiple separate scrapy processes, avoiding any bottlenecks with the python GIL.

    0 讨论(0)
  • 2021-01-15 01:30

    The Requests system also works in parallel, see http://doc.scrapy.org/en/latest/topics/settings.html#concurrent-requests. Scrapy is designed to handle requesting and parsing in the spider itself, the callback methods make it asynchronous and by default multiple Requests work in parallel indeed.

    The item pipeline, which does process in parallel, isn't intended to do heavy parsing: it is rather meant to check and validate the values you got in each item. (http://doc.scrapy.org/en/latest/topics/item-pipeline.html)

    Therefore you should do your queries in the spider itself, as they are designed to be there. From the docs on spiders:

    Spiders are classes which define how a certain site (or group of sites) will be scraped, including how to perform the crawl (ie. follow links) and how to extract structured data from their pages (ie. scraping items).

    0 讨论(0)
提交回复
热议问题