How to use boto3 client with Python multiprocessing?

前端 未结 2 1822
滥情空心
滥情空心 2021-01-12 05:11

Code looks something like this:

import multiprocessing as mp
from functools import partial

import boto3
import numpy as np


s3 = boto3.client(\'s3\')

def          


        
相关标签:
2条回答
  • 2021-01-12 05:59

    Objects passed to mp.starmap() must be pickle-able, and S3 clients are not pickle-able. Bringing the actions of the S3 client outside of the function that calls mp.starmap() can solve the issue:

    import multiprocessing as mp
    from functools import partial
    
    import boto3
    import numpy as np
    
    
    s3 = boto3.client('s3')
    archive = np.load(s3.get_object('some_key')) # Simplified -- details not relevant # Move the s3 call here, outside of the do() function
    
    def _something(**kwargs):
        # Some mixed integer programming stuff related to the variable archive
        return np.array(some_variable_related_to_archive)
    
    
    def do(archive): # pass the previously loaded archive, and not the s3 object into the function
        pool = mp.pool()
        sub_process = partial(_something, slack=0.1)
        parts = np.array_split(archive, some_int)
        target_parts = np.array(things)
    
        out = pool.starmap(sub_process, [x for x in zip(parts, target_parts)] # Error occurs at this line
    
        pool.close()
        pool.join()
    
    do(archive) # pass the previously loaded archive, and not the s3 object into the function
    
    0 讨论(0)
  • 2021-01-12 06:01

    Well, I solved it in a pretty straightforward way. That is, using a more reduced a less complex object rather than . I used the class Bucket.

    However, you should keeping into consideration the following post: Can't pickle when using multiprocessing Pool.map(). I put every object related with boto3 outside any class of function. Some other posts suggest to put s3 objects and functions inside the function you're trying to parallize in order to avoid overhead, I haven't tried yet, though. Indeed, I'll share to you a code in which is possible to save information into a msgpack filetype.

    My code example is as follows (outside any class or function). Hope it helps.

    import pandas as pd
    import boto3
    from pathos.pools import ProcessPool
    
    s3 = boto3.resource('s3')
    s3_bucket_name = 'bucket-name'
    s3_bucket = s3.Bucket(s3_bucket_name)
    
    def msgpack_dump_s3 (df, filename):
        try:
            s3_bucket.put_object(Body=df.to_msgpack(), Key=filename)
            print(module, filename + " successfully saved into s3 bucket '" + s3_bucket.name + "'")
        except Exception as e:
            # logging all the others as warning
            print(module, "Failed deleting bucket. Continuing. {}".format(e))
    
    def msgpack_load_s3 (filename):
        try:
            return s3_bucket.Object(filename).get()['Body'].read()
        except ClientError as ex:
            if ex.response['Error']['Code'] == 'NoSuchKey':
                print(module, 'No object found - returning None')
                return None
            else:
                print(module, "Failed deleting bucket. Continuing. {}".format(ex))
                raise ex
        except Exception as e:
            # logging all the others as warning
            print(module, "Failed deleting bucket. Continuing. {}".format(e))
        return
    
    def upper_function():
    
        def function_to_parallelize(filename):
            file = msgpack_load_s3(filename)
            if file is not None:
                df = pd.read_msgpack(file)
            #do somenthing
    
            print('\t\t\tSaving updated info...')
            msgpack_dump_s3(df, filename)
    
    
            pool = ProcessPool(nodes=ncpus)
            # do an asynchronous map, then get the results
            results = pool.imap(function_to_parallelize, files)
            print("...")
            print(list(results))
            """
            while not results.ready():
                time.sleep(5)
                print(".", end=' ')
    
    0 讨论(0)
提交回复
热议问题