Scrapy: how to use items in spider and how to send items to pipelines?

前端 未结 1 1570
隐瞒了意图╮
隐瞒了意图╮ 2021-01-30 23:59

I am new to scrapy and my task is simple:

For a given e-commerce website:

  • crawl all website pages

  • look for products page

相关标签:
1条回答
  • 2021-01-31 00:46
    • How to use items in my spider?

    Well, the main purpose of items is to store the data you crawled. scrapy.Items are basically dictionaries. To declare your items, you will have to create a class and add scrapy.Field in it:

    import scrapy
    
    class Product(scrapy.Item):
        url = scrapy.Field()
        title = scrapy.Field()
    

    You can now use it in your spider by importing your Product.

    For advanced information, I let you check the doc here

    • How to send items to the pipeline ?

    First, you need to tell to your spider to use your custom pipeline.

    In the settings.py file:

    ITEM_PIPELINES = {
        'myproject.pipelines.CustomPipeline': 300,
    }
    

    You can now write your pipeline and play with your item.

    In the pipeline.py file:

    from scrapy.exceptions import DropItem
    
    class CustomPipeline(object):
        def __init__(self):
            # Create your database connection
    
        def process_item(self, item, spider):
            # Here you can index your item
            return item
    

    Finally, in your spider, you need to yield your item once it is filled.

    spider.py example:

    import scrapy
    from myspider.items import Product
    
    class MySpider(scrapy.Spider):
        name = "test"
        start_urls = ['http://www.exemple.com']
    
        def parse(self, response):
            doc = Product()
            doc['url'] = response.url
            doc['title'] = response.xpath('//div/p/text()')
            yield doc # Will go to your pipeline
    

    Hope this helps, here is the doc for pipelines: Item Pipeline

    0 讨论(0)
提交回复
热议问题