How to get the pipeline object in Scrapy spider

后端 未结 2 1556
隐瞒了意图╮
隐瞒了意图╮ 2021-01-13 05:02

I have use the mongodb to store the data of the crawl.

Now I want to query the last date of the data, that I can continue crawl the data and not ne

2条回答
  •  有刺的猬
    2021-01-13 06:04

    A Scrapy Pipeline has an open_spider method that gets executed after the spider is initialized. You can pass a reference to the database connection, the get_date() method, or the Pipeline itself, to your spider. An example of the latter with your code is:

    # This is my Pipline
    class MongoDBPipeline(object):
        def __init__(self, mongodb_db=None, mongodb_collection=None):
            self.connection = pymongo.Connection(settings['MONGODB_SERVER'], settings['MONGODB_PORT'])
            ....
    
        def process_item(self, item, spider):
            ....
        def get_date(self):
            ....
    
        def open_spider(self, spider):
            spider.myPipeline = self
    

    Then, in the spider:

    class Spider(Spider):
        name = "test"
    
        def __init__(self):
            self.myPipeline = None
    
        def parse(self, response):
            self.myPipeline.get_date()
    

    I don't think the __init__() method is necessary here, but I put it here to show that open_spider replaces it after initialization.

提交回复
热议问题