Can I speed up performance by applying functions to an item in a data object with multiprocessing?

后端未结

关注

 1  1175

Disclaimer: I have gone through loads of multiprocessing answers on SO and also documents and either the questions were really old (Python 3.X has

相关标签:

1条回答

渐次进展

2021-01-22 15:30

The reason your example is not performing well is because you are doing two totally different things.

In your list comprehension, you are mapping f onto each element of li.

In the second case, you are splitting your li list into jobs chunks and then apply your functon jobs times onto each of those chunks. And now, in f, n * 100 takes a chunk about a quarter the size of your original list, and multiplies it by 100, i.e., it uses the sequence-repitition operator, so creates a new list 100-times the size of the chunk:

>>> chunk = [1,2,3]
>>> chunk * 10
[1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3]
>>>

So basically, you are comparing apples to oranges.

However, multiprocessing already comes with an out-of-the box mapping utility. Here is a better comparison, a script called foo.py:

import time
import multiprocessing as mp

def f(x):
    return x * 100

if __name__ == '__main__':
    data = list(range(1000000))

    start = time.time()
    [f(i) for i in data]
    stop = time.time()
    print(f"List comprehension took {stop - start} seconds")

    start = time.time()
    with mp.Pool(4) as pool:
        result = pool.map(f, data)
    stop = time.time()
    print(f"Pool.map took {stop - start} seconds")

Now, here's some actual performance results:

(py37) Juans-MBP:test_mp juan$ python foo.py
List comprehension took 0.14193987846374512 seconds
Pool.map took 0.2513458728790283 seconds
(py37) Juans-MBP:test_mp juan$

For this very trivial function, the cost of the inter-process communication will always be higher than the cost of calculating the function serially. So you won't see any gains from multiprocessing. However, a much less trivial function can see gains from multiprocessing.

Here's a trivial example, I simply sleep for a microsecond before multiplying:

import time
import multiprocessing as mp

def f(x):
    time.sleep(0.000001)
    return x * 100

if __name__ == '__main__':
    data = list(range(1000000))

    start = time.time()
    [f(i) for i in data]
    stop = time.time()
    print(f"List comprehension took {stop - start} seconds")

    start = time.time()
    with mp.Pool(4) as pool:
        result = pool.map(f, data)
    stop = time.time()
    print(f"Pool.map took {stop - start} seconds")

And now, you see gains commensurate with the number of processes:

(py37) Juans-MBP:test_mp juan$ python foo.py
List comprehension took 13.175776720046997 seconds
Pool.map took 3.1484851837158203 seconds

Note, on my machine, a single multiplication takes orders of magnitude less time than a microsecond (about 10 nanoseconds):

>>> import timeit
>>> timeit.timeit('100*3', number=int(1e6))*1e-6
1.1292944999993892e-08

0 讨论(0)