python -> multiprocessing module

前端 未结 2 1019
别跟我提以往
别跟我提以往 2021-02-04 18:07

Here\'s what I am trying to accomplish -

  1. I have about a million files which I need to parse & append the parsed content to a single file.
  2. Since a sin
相关标签:
2条回答
  • 2021-02-04 18:49

    Although the discussion with Eric was fruitful, later on I found a better way of doing this. Within the multiprocessing module there is a method called 'Pool' which is perfect for my needs.

    It's optimizes itself to the number of cores my system has. i.e. only as many processes are spawned as the no. of cores. Of course this is customizable. So here's the code. Might help someone later-

    from multiprocessing import Pool
    
    def main():
        po = Pool()
        for file in glob.glob('*.csv'):
            filepath = os.path.join(DATA_DIR, file)
            po.apply_async(mine_page, (filepath,), callback=save_data)
        po.close()
        po.join()
        file_ptr.close()
    
    def mine_page(filepath):
        #do whatever it is that you want to do in a separate process.
        return data
    
    def save_data(data):
        #data is a object. Store it in a file, mysql or...
        return
    

    Still going through this huge module. Not sure if save_data() is executed by parent process or this function is used by spawned child processes. If it's the child which does the saving it might lead to concurrency issues in some situations. If anyone has anymore experience in using this module, you appreciate more knowledge here...

    0 讨论(0)
  • 2021-02-04 18:57

    The docs for multiprocessing indicate several methods of sharing state between processes:

    http://docs.python.org/dev/library/multiprocessing.html#sharing-state-between-processes

    I'm sure each process gets a fresh interpreter and then the target (function) and args are loaded into it. In that case, the global namespace from your script would have been bound to your worker function, so the data_file would be there. However, I am not sure what happens to the file descriptor as it is copied across. Have you tried passing the file object as one of the args?

    An alternative is to pass another Queue that will hold the results from the workers. The workers put the results and the main code gets the results and writes it to the file.

    0 讨论(0)
提交回复
热议问题