Can I speed up performance by applying functions to an item in a data object with multiprocessing?

后端 未结 1 1173
感动是毒
感动是毒 2021-01-22 14:51

Disclaimer: I have gone through loads of multiprocessing answers on SO and also documents and either the questions were really old (Python 3.X has

相关标签:
1条回答
  • 2021-01-22 15:30

    The reason your example is not performing well is because you are doing two totally different things.

    In your list comprehension, you are mapping f onto each element of li.

    In the second case, you are splitting your li list into jobs chunks and then apply your functon jobs times onto each of those chunks. And now, in f, n * 100 takes a chunk about a quarter the size of your original list, and multiplies it by 100, i.e., it uses the sequence-repitition operator, so creates a new list 100-times the size of the chunk:

    >>> chunk = [1,2,3]
    >>> chunk * 10
    [1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3]
    >>>
    

    So basically, you are comparing apples to oranges.

    However, multiprocessing already comes with an out-of-the box mapping utility. Here is a better comparison, a script called foo.py:

    import time
    import multiprocessing as mp
    
    def f(x):
        return x * 100
    
    if __name__ == '__main__':
        data = list(range(1000000))
    
        start = time.time()
        [f(i) for i in data]
        stop = time.time()
        print(f"List comprehension took {stop - start} seconds")
    
        start = time.time()
        with mp.Pool(4) as pool:
            result = pool.map(f, data)
        stop = time.time()
        print(f"Pool.map took {stop - start} seconds")
    

    Now, here's some actual performance results:

    (py37) Juans-MBP:test_mp juan$ python foo.py
    List comprehension took 0.14193987846374512 seconds
    Pool.map took 0.2513458728790283 seconds
    (py37) Juans-MBP:test_mp juan$
    

    For this very trivial function, the cost of the inter-process communication will always be higher than the cost of calculating the function serially. So you won't see any gains from multiprocessing. However, a much less trivial function can see gains from multiprocessing.

    Here's a trivial example, I simply sleep for a microsecond before multiplying:

    import time
    import multiprocessing as mp
    
    def f(x):
        time.sleep(0.000001)
        return x * 100
    
    if __name__ == '__main__':
        data = list(range(1000000))
    
        start = time.time()
        [f(i) for i in data]
        stop = time.time()
        print(f"List comprehension took {stop - start} seconds")
    
        start = time.time()
        with mp.Pool(4) as pool:
            result = pool.map(f, data)
        stop = time.time()
        print(f"Pool.map took {stop - start} seconds")
    

    And now, you see gains commensurate with the number of processes:

    (py37) Juans-MBP:test_mp juan$ python foo.py
    List comprehension took 13.175776720046997 seconds
    Pool.map took 3.1484851837158203 seconds
    

    Note, on my machine, a single multiplication takes orders of magnitude less time than a microsecond (about 10 nanoseconds):

    >>> import timeit
    >>> timeit.timeit('100*3', number=int(1e6))*1e-6
    1.1292944999993892e-08
    
    0 讨论(0)
提交回复
热议问题