correct way to nest Item data in scrapy

前端 未结 2 916
故里飘歌
故里飘歌 2020-12-05 08:36

What is the correct way to nest Item data?

For example, I want the output of a product:

{
\'price\': price,
\'title\': title,
\'meta\': {
    \'url\'         


        
相关标签:
2条回答
  • 2020-12-05 09:01

    UPDATE from comments: Looks like nested loaders is the updated approach. Another comment suggests this approach will cause errors during serialization.

    Best way to approach this is by creating a main and a meta item class/loader.

    from scrapy.item import Item, Field
    from scrapy.contrib.loader import ItemLoader
    from scrapy.contrib.loader.processor import TakeFirst
    
    
    class MetaItem(Item):
        url = Field()
        added_on = Field()
    
    
    class MainItem(Item):
        price = Field()
        title = Field()
        meta = Field(serializer=MetaItem)
    
    
    class MainItemLoader(ItemLoader):
        default_item_class = MainItem
        default_output_processor = TakeFirst()
    
    
    class MetaItemLoader(ItemLoader):
        default_item_class = MetaItem
        default_output_processor = TakeFirst()
    

    Sample usage:

    from scrapy.spider import Spider
    from qwerty.items import  MainItemLoader, MetaItemLoader
    from scrapy.selector import Selector
    
    
    class DmozSpider(Spider):
        name = "dmoz"
        allowed_domains = ["example.com"]
        start_urls = ["http://example.com"]
    
        def parse(self, response):
            mainloader = MainItemLoader(selector=Selector(response))
            mainloader.add_value('title', 'test')
            mainloader.add_value('price', 'price')
            mainloader.add_value('meta', self.get_meta(response))
            return mainloader.load_item()
    
        def get_meta(self, response):
            metaloader = MetaItemLoader(selector=Selector(response))
            metaloader.add_value('url', response.url)
            metaloader.add_value('added_on', 'now')
            return metaloader.load_item()
    

    After that, you can easily expand your items in the future by creating more "sub-items."

    0 讨论(0)
  • 2020-12-05 09:27

    I think it would be more straightforward to construct the dictionary in the spider. Here are two different ways of doing it, both achieving the same result. The only possible dealbreaker here is that the processors apply on the item['meta'] field, not on the item['meta']['added_on'] and item['meta']['url'] fields.

    def parse(self, response):
        item = MyItem()
        item['meta'] = {'added_on': response.css("a::text").extract()[0]}
        item['meta']['url'] = response.xpath("//a/@href").extract()[0]
        return item
    

    Is there a specific reason for which you want to construct it that way instead of unpacking the meta field ?

    0 讨论(0)
提交回复
热议问题