Is there any faster way for downloading multiple files from s3 to local folder?

前端 未结 1 1782
闹比i
闹比i 2021-02-09 03:18

I am trying to download 12,000 files from s3 bucket using jupyter notebook, which is estimating to complete download in 21 hours. This is because each file is downloaded one at

相关标签:
1条回答
  • 2021-02-09 03:38

    See the code below. This will only work with python 3.6+, because of the f-string (PEP 498). Use a different method of string formatting for older versions of python.

    Provide the relative_path, bucket_name and s3_object_keys. In addition, max_workers is optional, and if not provided the number will be a multiple of 5 times the number of machine processors.

    Most of the code for this answer came from an answer to How to create an async generator in Python? which sources from this example documented in the library.

    import boto3
    import os
    from concurrent import futures
    
    
    relative_path = './images'
    bucket_name = 'bucket_name'
    s3_object_keys = [] # List of S3 object keys
    max_workers = 5
    
    abs_path = os.path.abspath(relative_path)
    s3 = boto3.client('s3')
    
    def fetch(key):
        file = f'{abs_path}/{key}'
        os.makedirs(file, exist_ok=True)  
        with open(file, 'wb') as data:
            s3.download_fileobj(bucket_name, key, data)
        return file
    
    
    def fetch_all(keys):
    
        with futures.ThreadPoolExecutor(max_workers=5) as executor:
            future_to_key = {executor.submit(fetch, key): key for key in keys}
    
            print("All URLs submitted.")
    
            for future in futures.as_completed(future_to_key):
    
                key = future_to_key[future]
                exception = future.exception()
    
                if not exception:
                    yield key, future.result()
                else:
                    yield key, exception
    
    
    for key, result in fetch_all(S3_OBJECT_KEYS):
        print(f'key: {key}  result: {result}')
    
    0 讨论(0)
提交回复
热议问题