Scrapy: constructing non-duplicative list of absolute paths from relative paths

后端未结

关注

 2  816

Question: how do I use Scrapy to create a non-duplicative list of absolute paths from relative paths under the img src tag?

相关标签:

2条回答

我在风中等你

2021-01-24 06:51

I would use an Item Pipeline to deal with the duplicated items.

# file: yourproject/pipelines.py
from scrapy.exceptions import DropItem

class DuplicatesPipeline(object):

    def __init__(self):
        self.url_seen = set()

    def process_item(self, item, spider):
        if item['url'] in self.url_seen:
            raise DropItem("Duplicate item found: %s" % item)
        else:
            self.url_seen.add(item['url'])
            return item

And add this pipeline to your settings.py

# file: yourproject/settings.py
ITEM_PIPELINES = {
    'your_project.pipelines.DuplicatesPipeline': 300,
}

Then you just need to run your spider scrapy crawl relpathfinder -o items.csv and the pipeline will Drop duplicate items for you. So will not see any duplicate in your csv output.

0 讨论(0)

长发绾君心

2021-01-24 06:59

What about:

def url_join(self,response):
    item=MyItem()
    item['url']=[]
    relative_url=response.xpath('//img/@src').extract()
    for link in relative_url:
        item['url'] = response.urljoin(link)
        yield item

0 讨论(0)