CPU-intensive parsing with scrapy

前端 未结 2 525
[愿得一人]
[愿得一人] 2021-01-15 00:34

The CONCURRENT_ITEMS section at http://doc.scrapy.org/en/latest/topics/settings.html#concurrent-items defines it as:

Maximum number of concurrent item

2条回答
  •  孤街浪徒
    2021-01-15 01:30

    The Requests system also works in parallel, see http://doc.scrapy.org/en/latest/topics/settings.html#concurrent-requests. Scrapy is designed to handle requesting and parsing in the spider itself, the callback methods make it asynchronous and by default multiple Requests work in parallel indeed.

    The item pipeline, which does process in parallel, isn't intended to do heavy parsing: it is rather meant to check and validate the values you got in each item. (http://doc.scrapy.org/en/latest/topics/item-pipeline.html)

    Therefore you should do your queries in the spider itself, as they are designed to be there. From the docs on spiders:

    Spiders are classes which define how a certain site (or group of sites) will be scraped, including how to perform the crawl (ie. follow links) and how to extract structured data from their pages (ie. scraping items).

提交回复
热议问题