Renaming downloaded images in Scrapy 0.24 with content from an item field while avoiding filename conflicts?

后端 未结 2 824
清酒与你
清酒与你 2020-12-24 09:47

I\'m attempting to rename the images that are downloaded by my Scrapy 0.24 spider. Right now the downloaded images are stored with a SHA1 hash of their URLs as the file name

相关标签:
2条回答
  • 2020-12-24 10:33

    Since the URL hash will make sure you'll end up with a unique identifier, you could perhaps just write separately to a file the item's value and the URL hash.

    After all is done, you can then just loop over this file and do the renaming (and using a Counter dictionary to make sure you rename them with a number appended based on how many Items with an equal value).

    0 讨论(0)
  • 2020-12-24 10:49

    The pipelines.py:

    from scrapy.pipelines.images import ImagesPipeline
    from scrapy.http import Request
    from scrapy.exceptions import DropItem
    from scrapy import log
    
    class MyImagesPipeline(ImagesPipeline):
    
        #Name download version
        def file_path(self, request, response=None, info=None):
            image_guid = request.meta['model'][0]
            log.msg(image_guid, level=log.DEBUG)
            return 'full/%s' % (image_guid)
    
        #Name thumbnail version
        def thumb_path(self, request, thumb_id, response=None, info=None):
            image_guid = thumb_id + request.url.split('/')[-1]
            log.msg(image_guid, level=log.DEBUG)
            return 'thumbs/%s/%s.jpg' % (thumb_id, image_guid)
    
        def get_media_requests(self, item, info):
            yield Request(item['image_urls'][0], meta=item)
    

    You're using the settings.py wrong. You should use this:

    ITEM_PIPELINES = {'allenheath.pipelines.MyImagesPipeline': 1}
    

    For thumbsnails to work, add this to settings.py:

    IMAGES_THUMBS = {
        'small': (50, 50),
        'big': (100, 100),
    }
    
    0 讨论(0)
提交回复
热议问题