问题
A joblib module provides a simple helper class to write parallel for loops using multiprocessing.
This code uses a list comprehension to do the job :
import time
from math import sqrt
from joblib import Parallel, delayed
start_t = time.time()
list_comprehension = [sqrt(i ** 2) for i in range(1000000)]
print('list comprehension: {}s'.format(time.time() - start_t))
takes about 0.51s
list comprehension: 0.5140271186828613s
This code uses joblib.Parallel()
constructor :
start_t = time.time()
list_from_parallel = Parallel(n_jobs=2)(delayed(sqrt)(i ** 2) for i in range(1000000))
print('Parallel: {}s'.format(time.time() - start_t))
takes about 31s
Parallel: 31.3990638256073s
Why is that? Shouldn't Parallel()
become faster than a non-paralleled computation?
Here is part of the cpuinfo
:
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 79
model name : Intel(R) Xeon(R) CPU @ 2.20GHz
stepping : 0
microcode : 0x1
cpu MHz : 2200.000
cache size : 56320 KB
physical id : 0
siblings : 8
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
回答1:
Q : Shouldn't
Parallel()
become faster than a non-paralleled computation?
Well, that depends, depends a lot on circumstances ( be it a joblib.Parallel()
or other way ).
There are no benefits that would ever come for free ( All such promises failed to deliver, since 1917 ... )
Plus,
it is very easy to happen to
pay way more ( on spawning processes for starting a multiprocessing )
than you receive back ( speedup expected over an original workflow ) ... so a due care is a must
The best first step:
Revisit the Amdahl's law revision and criticism about process-scheduling effects (speedup achieved form reorganisation of process-flows and using, at least in some part, a parallel process-scheduling).
The original Amdahl's formulation was not explicit on so called add-on "costs" one has to pay for going into parallel work-flows, that are not in the budget of the original, pure-[SERIAL]
flow-of-work.
1) Process-instantiations was always expensive in python, as it first has to replicate as many copies (O/S-driven RAM-allocations sized for n_jobs
(2)-copies + O/S-driven copying the RAM-image of the main python session) ( Thread-based multiprocessing does negative speedup, as there still remains GIL-lock re-[SERIAL]
-isation of work-steps among all spawned threads, so you get nothing, while you have paid immense add-on costs for spawning + for each add-on GIL-ackquire/GIL-release step-dancing step - an awful antipattern for compute-intensive tasks, it may help mask some cases of I/O-related latencies, but definitely not a case for computing intensive workloads )
2) Add-on costs for parameters' transfer - you have to move some data from main process towards the new ones. It costs add-on time and you have to pay this add-on cost, that is not present in the original, pure-[SERIAL]
workflow.
3) Add-on costs for results return transfer - you have to move some data from the new ones back to the originating (main) process. It costs add-on time and you have to pay this add-on cost, that is not present in the original, pure-[SERIAL]
workflow.
4) Add-on costs for any data interchange ( better avoid any tempting to use this in parallel workflows - why? a) It blocks + b) It is expensive and you have to pay even more add-on costs for getting any further, which you do not pay in a pure-[SERIAL]
original workflow ).
Q : Why does
joblib.Parallel()
take much more time than non-paralleled computation?
Simply, because you have to pay way, way more to launch the whole orchestrated circus, than you will receive back from such parallel work-flow organisation ( too small amount of work in math.sqrt( <int> )
to ever justify the relative-immense costs of spawning 2-full-copies of the original python-(main)-session + all the orchestration of dances to send just each and every ( <int>
)-from-(main)-there and retrieving a returning each resulting ( <float>
)-from-(joblib.Parallel()-process)-back-to-(main).
Your raw benchmarking times provide sufficient comparison of the accumulated costs to do the same result:
[SERIAL]-<iterator> feeding a [SERIAL]-processing storing into list[]: 0.51 [s]
[SERIAL]-<iterator> feeding [PARALLEL]-processing storing into list[]: 31.39 [s]
Raw estimate says about 30.9 second were "wasted" to do the same (small) amount of work just by forgetting about the add-on costs one has always to pay.
So, how to measure How Much You Have To Pay ... before one has to pay it...?
Benchmark, benchmark, benchmark the actual code ... (prototype)
If interested in benchmarking these costs - how long does it take in [us]
( i.e. How Much You Have To Pay, before any useful work even starts ) to do 1), 2) or 3), there were posted benchmarking templates to test and validate these principal costs on one's own platform, before being able to decide, what is a minimum work-package, that can justify these un-avoidable expenses and yield a "positive" speedup any greater, ( best a lot greater ) >> 1.0000
when compared to the pure-[SERIAL]
original.
来源:https://stackoverflow.com/questions/57706763/why-does-joblib-parallel-take-much-more-time-than-a-non-paralleled-computation