I am making some benchmarks with CUDA, C++, C#, Java, and using MATLAB for verification and matrix generation. When I perform matrix multiplication with MATLAB, 2048x
Here's my results using MATLAB R2011a + Parallel Computing Toolbox on a machine with a Tesla C2070:
>> A = rand(1024); gA = gpuArray(A);
% warm up by executing the operations a couple of times, and then:
>> tic, C = A * A; toc
Elapsed time is 0.075396 seconds.
>> tic, gC = gA * gA; toc
Elapsed time is 0.008621 seconds.
MATLAB uses highly optimized libraries for matrix multiplication which is why the plain MATLAB matrix multiplication is so fast. The gpuArray
version uses MAGMA.
Update using R2014a on a machine with a Tesla K20c, and the new timeit
and gputimeit
functions:
>> A = rand(1024); gA = gpuArray(A);
>> timeit(@()A*A)
ans =
0.0324
>> gputimeit(@()gA*gA)
ans =
0.0022
Update using R2018b on a WIN64 machine with 16 physical cores and a Tesla V100:
>> timeit(@()A*A)
ans =
0.0229
>> gputimeit(@()gA*gA)
ans =
4.8019e-04
(NB: at some point (I forget when exactly) gpuArray
switched from MAGMA to cuBLAS - MAGMA is still used for some gpuArray
operations though)