multiprocessing global variable memory copying

前端 未结 1 658
情深已故
情深已故 2021-02-06 03:54

I am running a program which loads 20 GB data to the memory at first. Then I will do N (> 1000) independent tasks where each of them may use (read only) part of the 20 GB data.

相关标签:
1条回答
  • 2021-02-06 04:09

    In linux, forked processes have a copy-on-write view of the parent address space. forking is light-weight and the same program runs in both the parent and the child, except that the child takes a different execution path. As a small exmample,

    import os
    var = "unchanged"
    pid = os.fork()
    if pid:
        print('parent:', os.getpid(), var)
        os.waitpid(pid, 0)
    else:
        print('child:', os.getpid(), var)
        var = "changed"
    
    # show parent and child views
    print(os.getpid(), var)
    

    Results in

    parent: 22642 unchanged
    child: 22643 unchanged
    22643 changed
    22642 unchanged
    

    Applying this to multiprocessing, in this example I load data into a global variable. Since python pickles the data sent to the process pool, I make sure it pickles something small like an index and have the worker get the global data itself.

    import multiprocessing as mp
    import os
    
    my_big_data = "well, bigger than this"
    
    def worker(index):
        """get char in big data"""
        return my_big_data[index]
    
    if __name__ == "__main__":
        pool = mp.Pool(os.cpu_count())
        for c in pool.imap_unordered(worker, range(len(my_big_data)), chunksize=1):
            print(c)
    

    Windows does not have a fork-and-exec model for running programs. It has to start a new instance of the python interpreter and clone all relevant data to the child. This is a heavy lift!

    0 讨论(0)
提交回复
热议问题