Scrapy: constructing non-duplicative list of absolute paths from relative paths

后端 未结 2 816
我在风中等你
我在风中等你 2021-01-24 06:18

Question: how do I use Scrapy to create a non-duplicative list of absolute paths from relative paths under the img src tag?

<
相关标签:
2条回答
  • 2021-01-24 06:51

    I would use an Item Pipeline to deal with the duplicated items.

    # file: yourproject/pipelines.py
    from scrapy.exceptions import DropItem
    
    class DuplicatesPipeline(object):
    
        def __init__(self):
            self.url_seen = set()
    
        def process_item(self, item, spider):
            if item['url'] in self.url_seen:
                raise DropItem("Duplicate item found: %s" % item)
            else:
                self.url_seen.add(item['url'])
                return item
    

    And add this pipeline to your settings.py

    # file: yourproject/settings.py
    ITEM_PIPELINES = {
        'your_project.pipelines.DuplicatesPipeline': 300,
    }
    

    Then you just need to run your spider scrapy crawl relpathfinder -o items.csv and the pipeline will Drop duplicate items for you. So will not see any duplicate in your csv output.

    0 讨论(0)
  • 2021-01-24 06:59

    What about:

    def url_join(self,response):
        item=MyItem()
        item['url']=[]
        relative_url=response.xpath('//img/@src').extract()
        for link in relative_url:
            item['url'] = response.urljoin(link)
            yield item
    
    0 讨论(0)
提交回复
热议问题