How to use boto3 client with Python multiprocessing?

前端未结

关注

 2  1822

Code looks something like this:

import multiprocessing as mp
from functools import partial

import boto3
import numpy as np


s3 = boto3.client(\'s3\')

def


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  说谎        
                
              
                            
                2021-01-12 05:59
              
            
            
                                                                       
Objects passed to mp.starmap() must be pickle-able, and S3 clients are not pickle-able. Bringing the actions of the S3 client outside of the function that calls mp.starmap() can solve the issue:

import multiprocessing as mp
from functools import partial

import boto3
import numpy as np


s3 = boto3.client('s3')
archive = np.load(s3.get_object('some_key')) # Simplified -- details not relevant # Move the s3 call here, outside of the do() function

def _something(**kwargs):
    # Some mixed integer programming stuff related to the variable archive
    return np.array(some_variable_related_to_archive)


def do(archive): # pass the previously loaded archive, and not the s3 object into the function
    pool = mp.pool()
    sub_process = partial(_something, slack=0.1)
    parts = np.array_split(archive, some_int)
    target_parts = np.array(things)

    out = pool.starmap(sub_process, [x for x in zip(parts, target_parts)] # Error occurs at this line

    pool.close()
    pool.join()

do(archive) # pass the previously loaded archive, and not the s3 object into the function

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  长情又很酷        
                
              
                            
                2021-01-12 06:01
              
            
            
                                                                       
Well, I solved it in a pretty straightforward way. That is, using a more reduced a less complex object rather than . I used the class Bucket.

However, you should keeping into consideration the following post: Can't pickle  when using multiprocessing Pool.map(). I put every object related with boto3 outside any class of function. Some other posts suggest to put s3 objects and functions inside the function you're trying to parallize in order to avoid overhead, I haven't tried yet, though. Indeed, I'll share to you a code in which is possible to save information into a msgpack filetype.

My code example is as follows (outside any class or function). Hope it helps.

import pandas as pd
import boto3
from pathos.pools import ProcessPool

s3 = boto3.resource('s3')
s3_bucket_name = 'bucket-name'
s3_bucket = s3.Bucket(s3_bucket_name)

def msgpack_dump_s3 (df, filename):
    try:
        s3_bucket.put_object(Body=df.to_msgpack(), Key=filename)
        print(module, filename + " successfully saved into s3 bucket '" + s3_bucket.name + "'")
    except Exception as e:
        # logging all the others as warning
        print(module, "Failed deleting bucket. Continuing. {}".format(e))

def msgpack_load_s3 (filename):
    try:
        return s3_bucket.Object(filename).get()['Body'].read()
    except ClientError as ex:
        if ex.response['Error']['Code'] == 'NoSuchKey':
            print(module, 'No object found - returning None')
            return None
        else:
            print(module, "Failed deleting bucket. Continuing. {}".format(ex))
            raise ex
    except Exception as e:
        # logging all the others as warning
        print(module, "Failed deleting bucket. Continuing. {}".format(e))
    return

def upper_function():

    def function_to_parallelize(filename):
        file = msgpack_load_s3(filename)
        if file is not None:
            df = pd.read_msgpack(file)
        #do somenthing

        print('\t\t\tSaving updated info...')
        msgpack_dump_s3(df, filename)


        pool = ProcessPool(nodes=ncpus)
        # do an asynchronous map, then get the results
        results = pool.imap(function_to_parallelize, files)
        print("...")
        print(list(results))
        """
        while not results.ready():
            time.sleep(5)
            print(".", end=' ')

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复