I\'m running an algorithm that is implemented in Python and uses NumPy. The most computationally expensive part of the algorithm involves solving a set of linear sys
The reason for this behavior could be that Accelerate uses multithreading, while the others don't.
Most BLAS implementations follow the environment variable OMP_NUM_THREADS
to determine how many threads to use. I believe they only use 1 thread if not told otherwise explicitly.
Accelerate's man page, however sounds like threading is turned on by default; it can be turned off by setting the environment variable VECLIB_MAXIMUM_THREADS
.
To determine if this is really what's happening, try
export VECLIB_MAXIMUM_THREADS=1
before calling the Accelerate version, and
export OMP_NUM_THREADS=4
for the other versions.
Independent of whether this is really the reason, it's a good idea to always set these variables when you use BLAS to be sure you control what is going on.