Scrapy image download how to use custom filename

前端 未结 6 1230
眼角桃花
眼角桃花 2020-11-28 07:27

For my scrapy project I\'m currently using the ImagesPipeline. The downloaded images are stored with a SHA1 hash of their URLs as the file names.

How can I s

相关标签:
6条回答
  • 2020-11-28 07:39

    In scrapy 0.12 I solved something like this

    class MyImagesPipeline(ImagesPipeline):
    
        #Name download version
        def image_key(self, url):
            image_guid = url.split('/')[-1]
            return 'full/%s.jpg' % (image_guid)
    
        #Name thumbnail version
        def thumb_key(self, url, thumb_id):
            image_guid = thumb_id + url.split('/')[-1]
            return 'thumbs/%s/%s.jpg' % (thumb_id, image_guid)
    
        def get_media_requests(self, item, info):
            yield Request(item['images'])
    
    0 讨论(0)
  • 2020-11-28 07:43

    This is just actualization of the answer for scrapy 0.24 (EDITED), where the image_key() is deprecated

    class MyImagesPipeline(ImagesPipeline):
    
        #Name download version
        def file_path(self, request, response=None, info=None):
            #item=request.meta['item'] # Like this you can use all from item, not just url.
            image_guid = request.url.split('/')[-1]
            return 'full/%s' % (image_guid)
    
        #Name thumbnail version
        def thumb_path(self, request, thumb_id, response=None, info=None):
            image_guid = thumb_id + response.url.split('/')[-1]
            return 'thumbs/%s/%s.jpg' % (thumb_id, image_guid)
    
        def get_media_requests(self, item, info):
            #yield Request(item['images']) # Adding meta. Dunno how to put it in one line :-)
            for image in item['images']:
                yield Request(image)
    
    0 讨论(0)
  • 2020-11-28 07:50

    I did a nasty quick hack for that. In my case, I stored the title of image in my feeds. And, I had only 1 image_urls per item, so, I wrote the following script. It basically renames the image files in the /images/full/ directory with the corresponding title in the item feed that I had stored in as json.

    import os
    import json
    
    img_dir = os.path.join(os.getcwd(), 'images\\full')
    item_dir = os.path.join(os.getcwd(), 'data.json')
    
    with open(item_dir, 'r') as item_json:
        items = json.load(item_json)
    
    for item in items:
        if len(item['images']) > 0:
            cur_file = item['images'][0]['path'].split('/')[-1]
            cur_format = cur_file.split('.')[-1]
            new_title = item['title']+'.%s'%cur_format
            file_path = os.path.join(img_dir, cur_file)
            os.rename(file_path, os.path.join(img_dir, new_title))
    

    It's nasty & not recommended. But, it is a naive alternative approach.

    0 讨论(0)
  • 2020-11-28 07:54

    I rewrite the code, changing, in thumb_path def, "response." by "request.". If no, it won't work because "response is set to None".

    class MyImagesPipeline(ImagesPipeline):
    
        #Name download version
        def file_path(self, request, response=None, info=None):
            #item=request.meta['item'] # Like this you can use all from item, not just url.
            image_guid = request.url.split('/')[-1]
            return 'full/%s' % (image_guid)
    
        #Name thumbnail version
        def thumb_path(self, request, thumb_id, response=None, info=None):
            image_guid = thumb_id + request.url.split('/')[-1]
            return 'thumbs/%s/%s.jpg' % (thumb_id, image_guid)
    
        def get_media_requests(self, item, info):
            #yield Request(item['images']) # Adding meta. Dunno how to put it in one line :-)
            for image in item['images']:
                yield Request(image)
    
    0 讨论(0)
  • 2020-11-28 07:59

    I found my way in 2017,scrapy 1.1.3

    def file_path(self, request, response=None, info=None):
        return request.meta.get('filename','')
    
    def get_media_requests(self, item, info):
        img_url = item['img_url']
        meta = {'filename': item['name']}
        yield Request(url=img_url, meta=meta)
    

    like the code above,you can add the name you want to a Request meta in get_media_requests(), and get it back in file_path() by request.meta.get('yourname','').

    0 讨论(0)
  • 2020-11-28 08:00

    This was the way I solved the problem in Scrapy 0.10 . Check the method persist_image of FSImagesStoreChangeableDirectory. The filename of the downloaded image is key

    class FSImagesStoreChangeableDirectory(FSImagesStore):
    
        def persist_image(self, key, image, buf, info,append_path):
    
            absolute_path = self._get_filesystem_path(append_path+'/'+key)
            self._mkdir(os.path.dirname(absolute_path), info)
            image.save(absolute_path)
    
    class ProjectPipeline(ImagesPipeline):
    
        def __init__(self):
            super(ImagesPipeline, self).__init__()
            store_uri = settings.IMAGES_STORE
            if not store_uri:
                raise NotConfigured
            self.store = FSImagesStoreChangeableDirectory(store_uri)
    
    0 讨论(0)
提交回复
热议问题