On my two computers, I tried this code:
N <- 10e3
M <- 2000
X <- matrix(rnorm(N * M), N)
system.time(crossprod(X))
The first one i
R is distributed with a default BLAS implementation, but it may not be optimized for your computer. Intalling optimized versions of BLAS through ATLAS or OpenBLAS prior to R as advised on the Installation guide of R is the way to go. If you clik on Download R for Linux
and then on debian/
. It is said that:
You may want to install the automatically tuned Atlas or the multi-threaded OpenBlas library in order to get higher performance for linear algebra operations
The source of R can be downloaded here and the BLAS implementation is located in R-3.5.0/src/extra/blas
. For instance, the Fortran source code of matrix-matrix multiplication dgemm
is located in blas.f, along most BLAS routines (in a single file!).
The comments of the function specifies:
-- Written on 8-February-1989.
Jack Dongarra, Argonne National Laboratory.
Iain Duff, AERE Harwell.
Jeremy Du Croz, Numerical Algorithms Group Ltd.
Sven Hammarling, Numerical Algorithms Group Ltd.
The very same lines can be found in the netlib implementation of the routine dgemm.
On the contrary, OpenBLAS provides different implementations, one for each kind of processors. See for instance this file devoted to dgemm for haswell microarchitecture. There are calls to prefetcht0 for prefetching and calls to vfmadd231pd, a vectorial FMA SIMD instruction which performs double precision d=a*b+c 4 times at once.
Using an optimized BLAS can save the day. See for instance this benchmark, where netlib's dgemm() lasts 64 seconds where MKL, OpenBLAS or ATLAS dgemm takes less than 4 seconds.
The case of R internal BLAS might be worse than the classical Netlib library. Indeed, as specified in the appendix A.3.1.5 Shared BLAS
in the R Installation and Administration
:
R offers the option of compiling the BLAS into a dynamic library libRblas stored in R_HOME/lib and linking both R itself and all the add-on packages against that library. .... There may be performance disadvantages in using a shared BLAS. ... However, experiments showed that in many cases using a shared BLAS was as fast, provided high levels of compiler optimization are used.
Looking at the config.site file of R, it is written that the optimization level is '-O2' for g77/gfortran. Therefore, tunning the FFLAGS option could proove useful if the fortran compiler is not g77/gfortran. During the configure step, there should be a line like checking whether we are using the GNU Fortran 77 compiler... yes
(line 7521 of the configure file).
tldr: CentOS uses single-threaded OpenBLAS, Linux Mint uses Reference BLAS by default but can use other BLAS versions.
The R packages for CentOS available from EPEL depend on openblas-Rblas
. This seems to be an OpenBLAS build providing BLAS for R. So while it looks like R's BLAS is used, it actually is OpenBLAS. The LAPACK version is always the one provided by R.
On Debian and derived distributions like Mint, r-base-core
depends on
By default these are provided by the reference implementations libblas3
and liblapack3
. These are not particularly fast, but you can replace them easily by installing packages like libopenblas-base
. You have control over the BLAS and LAPACK used on your system via update-alternatives
.
For controlling the number of threads with OpenBLAS I normally use RhpcBLASctl
:
N <- 20000
M <- 2000
X <- matrix(rnorm(N * M), N)
RhpcBLASctl::blas_set_num_threads(2)
system.time(crossprod(X))
#> User System verstrichen
#> 2.492 0.331 1.339
RhpcBLASctl::blas_set_num_threads(1)
system.time(crossprod(X))
#> User System verstrichen
#> 2.319 0.052 2.316
For some reason setting the environment variables OPENBLAS_NUM_THREADS
, GOTO_NUM_THREADS
or OMP_NUM_THREADS
from R does not have the desired effect. On CentOS even RhpcBLASctl
does not help, since the used OpenBLAS is single-threaded.