Items vs item loaders in scrapy

前端 未结 1 754
萌比男神i
萌比男神i 2021-01-31 11:48

I\'m pretty new to scrapy, I know that items are used to populate scraped data, but I cant understand the difference between items and item loaders. I tried to read some example

1条回答
  •  深忆病人
    2021-01-31 12:11

    I really like the official explanation in the docs:

    Item Loaders provide a convenient mechanism for populating scraped Items. Even though Items can be populated using their own dictionary-like API, Item Loaders provide a much more convenient API for populating them from a scraping process, by automating some common tasks like parsing the raw extracted data before assigning it.

    In other words, Items provide the container of scraped data, while Item Loaders provide the mechanism for populating that container.

    Last paragraph should answer your question.
    Item loaders are great since they allow you to have so many processing shortcuts and reuse a bunch of code to keep everything tidy, clean and understandable.

    Comparison example case. Lets say we want to scrape this item:

    class MyItem(Item):
        full_name = Field()
        bio = Field()
        age = Field()
        weight = Field()
        height = Field()
    

    Item only approach would look something like this:

    def parse(self, response):
        full_name = response.xpath("//div[contains(@class,'name')]/text()").extract()
        # i.e. returns ugly ['John\n', '\n\t  ', '  Snow']
        item['full_name'] = ' '.join(i.strip() for i in full_name if i.strip())
        bio = response.xpath("//div[contains(@class,'bio')]/text()").extract()
        item['bio'] = ' '.join(i.strip() for i in full_name if i.strip())
        age = response.xpath("//div[@class='age']/text()").extract_first(0)
        item['age'] = int(age) 
        weight = response.xpath("//div[@class='weight']/text()").extract_first(0)
        item['weight'] = int(age) 
        height = response.xpath("//div[@class='height']/text()").extract_first(0)
        item['height'] = int(age) 
        return item
    

    vs Item Loaders approach:

    # define once in items.py 
    from scrapy.loader.processors import Compose, MapCompose, Join, TakeFirst
    clean_text = Compose(MapCompose(lambda v: v.strip()), Join())   
    to_int = Compose(TakeFirst(), int)
    
    class MyItemLoader(ItemLoader):
        default_item_class = MyItem
        full_name_out = clean_text
        bio_out = clean_text
        age_out = to_int
        weight_out = to_int
        height_out = to_int
    
    # parse as many different places and times as you want  
    def parse(self, response):
        loader = MyItemLoader(selector=response)
        loader.add_xpath('full_name', "//div[contains(@class,'name')]/text()")
        loader.add_xpath('bio', "//div[contains(@class,'bio')]/text()")
        loader.add_xpath('age', "//div[@class='age']/text()")
        loader.add_xpath('weight', "//div[@class='weight']/text()")
        loader.add_xpath('height', "//div[@class='height']/text()")
        return loader.load_item()
    

    As you can see the Item Loader is so much cleaner and easier to scale. Let's say you have 20 more fields from which a lot share the same processing logic, would be a suicide to do it without Item Loaders. Item Loaders are awesome and you should use them!

    0 讨论(0)
提交回复
热议问题