I'm running an algorithm that is implemented in Python and uses NumPy. The most computationally expensive part of the algorithm involves solving a set of linear systems (i.e. a call to numpy.linalg.solve()
. I came up with this small benchmark:
import numpy as np
import time
# Create two large random matrices
a = np.random.randn(5000, 5000)
b = np.random.randn(5000, 5000)
t1 = time.time()
# That's the expensive call:
np.linalg.solve(a, b)
print time.time() - t1
I've been running this on:
- My laptop, a late 2013 MacBook Pro 15" with 4 cores at 2GHz (
sysctl -n machdep.cpu.brand_string
gives me Intel(R) Core(TM) i7-4750HQ CPU @ 2.00GHz) - An Amazon EC2
c3.xlarge
instance, with 4 vCPUs. Amazon advertises them as "High Frequency Intel Xeon E5-2680 v2 (Ivy Bridge) Processors"
Bottom line:
- On the Mac it runs in ~4.5 seconds
- On the EC2 instance it runs in ~19.5 seconds
I have tried it also on other OpenBLAS / Intel MKL based setups, and the runtime is always comparable to what I get on the EC2 instance (modulo the hardware config.)
Can anyone explain why the performance on Mac (with the Accelerate Framework) is > 4x better? More details about the NumPy / BLAS setup in each are provided below.
Laptop setup
numpy.show_config()
gives me:
atlas_threads_info:
NOT AVAILABLE
blas_opt_info:
extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
extra_compile_args = ['-msse3', '-I/System/Library/Frameworks/vecLib.framework/Headers']
define_macros = [('NO_ATLAS_INFO', 3)]
atlas_blas_threads_info:
NOT AVAILABLE
openblas_info:
NOT AVAILABLE
lapack_opt_info:
extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
extra_compile_args = ['-msse3']
define_macros = [('NO_ATLAS_INFO', 3)]
atlas_info:
NOT AVAILABLE
lapack_mkl_info:
NOT AVAILABLE
blas_mkl_info:
NOT AVAILABLE
atlas_blas_info:
NOT AVAILABLE
mkl_info:
NOT AVAILABLE
EC2 instance setup:
On Ubuntu 14.04, I installed OpenBLAS with
sudo apt-get install libopenblas-base libopenblas-dev
When installing NumPy, I created a site.cfg
with the following contents:
[default]
library_dirs= /usr/lib/openblas-base
[atlas]
atlas_libs = openblas
numpy.show_config()
gives me:
atlas_threads_info:
libraries = ['lapack', 'openblas']
library_dirs = ['/usr/lib']
define_macros = [('ATLAS_INFO', '"\\"None\\""')]
language = f77
include_dirs = ['/usr/include/atlas']
blas_opt_info:
libraries = ['openblas']
library_dirs = ['/usr/lib']
language = f77
openblas_info:
libraries = ['openblas']
library_dirs = ['/usr/lib']
language = f77
lapack_opt_info:
libraries = ['lapack', 'openblas']
library_dirs = ['/usr/lib']
define_macros = [('ATLAS_INFO', '"\\"None\\""')]
language = f77
include_dirs = ['/usr/include/atlas']
openblas_lapack_info:
NOT AVAILABLE
lapack_mkl_info:
NOT AVAILABLE
blas_mkl_info:
NOT AVAILABLE
mkl_info:
NOT AVAILABLE
The reason for this behavior could be that Accelerate uses multithreading, while the others don't.
Most BLAS implementations follow the environment variable OMP_NUM_THREADS
to determine how many threads to use. I believe they only use 1 thread if not told otherwise explicitly.
Accelerate's man page, however sounds like threading is turned on by default; it can be turned off by setting the environment variable VECLIB_MAXIMUM_THREADS
.
To determine if this is really what's happening, try
export VECLIB_MAXIMUM_THREADS=1
before calling the Accelerate version, and
export OMP_NUM_THREADS=4
for the other versions.
Independent of whether this is really the reason, it's a good idea to always set these variables when you use BLAS to be sure you control what is going on.
来源:https://stackoverflow.com/questions/26511430/performance-of-numpy-with-different-blas-implementations