How to download scrapy images in to a dynamic folder?

后端 未结 2 734
小蘑菇
小蘑菇 2020-12-21 05:55

I am able to download images through scrapy in to the \"Full\" folder but I need to make the name of the destination folder dynamic, like full/session_id, every

相关标签:
2条回答
  • 2020-12-21 06:37

    here is the answer stackoverflow.com

    class StoreImgPipeline(ImagesPipeline):
        def file_path(self, request, response=None, info=None):
            image_guid = hashlib.sha1(to_bytes(request.url)).hexdigest()
            return 'realty-sc/%s/%s/%s/%s.jpg' % (YEAR, image_guid[:2], image_guid[2:4], image_guid)
    
    0 讨论(0)
  • 2020-12-21 06:46

    I have not worked with the ImagesPipeline yet, but following the documentation, I'd override item_completed(results, items, info).

    The original definition is:

    def item_completed(self, results, item, info):
        if self.IMAGES_RESULT_FIELD in item.fields:
            item[self.IMAGES_RESULT_FIELD] = [x for ok, x in results if ok]
        return item
    

    This should give you the result sets of the downloaded images including the path (seems there can be many images on one item).

    If you now change this method in a subclass to move all files before setting the path, it should work as you want. You could set the target folder on your item in something like item['session_path']. You'd have to set this setting on each item, before returning/yielding your items from the spider.

    The subclass with overriden method could then look like this:

    import os, os.path
    from scrapy.contrib.pipeline.images import ImagesPipeline
    
    class SessionImagesPipeline(ImagesPipeline):
        def item_completed(self, results, item, info):
            # iterate over the local file paths of all downloaded images
            for result in [x for ok, x in results if ok]:
                path = result['path']
                # here we create the session-path where the files should be in the end
                # you'll have to change this path creation depending on your needs
                target_path = os.path.join((item['session_path'], os.basename(path)))
    
                # try to move the file and raise exception if not possible
                if not os.rename(path, target_path):
                    raise ImageException("Could not move image to target folder")
    
                # here we'll write out the result with the new path,
                # if there is a result field on the item (just like the original code does)
                if self.IMAGES_RESULT_FIELD in item.fields:
                    result['path'] = target_path
                    item[self.IMAGES_RESULT_FIELD].append(result)
    
            return item
    

    Even nicer would be to set the desired session path not in the item, but in the configuration during your scrapy run. For this, you would have to find out how to set config while the application is running and you'd have to override the constructor, I think.

    0 讨论(0)
提交回复
热议问题