Is there any faster way for downloading multiple files from s3 to local folder?

前端未结

关注

 1  1784

闹比i 2021-02-09 03:18

I am trying to download 12,000 files from s3 bucket using jupyter notebook, which is estimating to complete download in 21 hours. This is because each file is downloaded one at

1条回答

一生所求 (楼主)

2021-02-09 03:38

See the code below. This will only work with python 3.6+, because of the f-string (PEP 498). Use a different method of string formatting for older versions of python.

Provide the relative_path, bucket_name and s3_object_keys. In addition, max_workers is optional, and if not provided the number will be a multiple of 5 times the number of machine processors.

Most of the code for this answer came from an answer to How to create an async generator in Python? which sources from this example documented in the library.

import boto3
import os
from concurrent import futures


relative_path = './images'
bucket_name = 'bucket_name'
s3_object_keys = [] # List of S3 object keys
max_workers = 5

abs_path = os.path.abspath(relative_path)
s3 = boto3.client('s3')

def fetch(key):
    file = f'{abs_path}/{key}'
    os.makedirs(file, exist_ok=True)  
    with open(file, 'wb') as data:
        s3.download_fileobj(bucket_name, key, data)
    return file


def fetch_all(keys):

    with futures.ThreadPoolExecutor(max_workers=5) as executor:
        future_to_key = {executor.submit(fetch, key): key for key in keys}

        print("All URLs submitted.")

        for future in futures.as_completed(future_to_key):

            key = future_to_key[future]
            exception = future.exception()

            if not exception:
                yield key, future.result()
            else:
                yield key, exception


for key, result in fetch_all(S3_OBJECT_KEYS):
    print(f'key: {key}  result: {result}')

0 讨论(0)