发表新帖

发表新帖

How to get the pipeline object in Scrapy spider

后端未结

关注

 2  1561

隐瞒了意图╮ 2021-01-13 05:02

I have use the mongodb to store the data of the crawl.

Now I want to query the last date of the data, that I can continue crawl the data and not ne

2条回答

暖寄归人 (楼主)

2021-01-13 05:50

According to the scrapy Architecture Overview:

The Item Pipeline is responsible for processing the items once they have been extracted (or scraped) by the spiders.

Basically that means that, first, scrapy spiders are working, then extracted items are going to the pipelines - no way to go backwards.

One possible solution would be, in the pipeline itself, check if the Item you've scraped is already in the database.

Another workaround would be to keep the list of urls you've crawled in the database, and, in the spider, check if you've already got the data from a url.

Since I'm not sure what do you mean by "start from the beginning" - I cannot suggest anything specific.

Hope at least this information helped.

0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...

热议问题