Why is MATLAB so fast in matrix multiplication?

前端 未结 12 1109
無奈伤痛
無奈伤痛 2020-11-22 00:29

I am making some benchmarks with CUDA, C++, C#, Java, and using MATLAB for verification and matrix generation. When I perform matrix multiplication with MATLAB, 2048x

12条回答
  •  爱一瞬间的悲伤
    2020-11-22 00:51

    Because MATLAB is a programming language at first developed for numerical linear algebra (matrix manipulations), which has libraries especially developed for matrix multiplications. And now MATLAB can also use the GPUs (Graphics processing unit) for this additionally.

    And if we look at your computation results:

                 1024x1024   2048x2048   4096x4096
                 ---------   ---------   ---------
    CUDA C (ms)      43.11      391.05     3407.99
    C++ (ms)       6137.10    64369.29   551390.93
    C# (ms)       10509.00   300684.00  2527250.00
    Java (ms)      9149.90    92562.28   838357.94
    MATLAB (ms)      75.01      423.10     3133.90
    

    then we can see that not only MATLAB is so fast in matrix multiplication: CUDA C (programming language from NVIDIA) has some better results than MATLAB. CUDA C has also libraries especially developed for matrix multiplications and it uses the GPUs.

    Short history of MATLAB

    Cleve Moler, the chairman of the computer science department at the University of New Mexico, started developing MATLAB in the late 1970s. He designed it to give his students access to LINPACK (a software library for performing numerical linear algebra) and EISPACK (is a software library for numerical computation of linear algebra) without them having to learn Fortran. It soon spread to other universities and found a strong audience within the applied mathematics community. Jack Little, an engineer, was exposed to it during a visit Moler made to Stanford University in 1983. Recognizing its commercial potential, he joined with Moler and Steve Bangert. They rewrote MATLAB in C and founded MathWorks in 1984 to continue its development. These rewritten libraries were known as JACKPAC. In 2000, MATLAB was rewritten to use a newer set of libraries for matrix manipulation, LAPACK (is a standard software library for numerical linear algebra).

    Source

    What is CUDA C

    CUDA C uses also libraries especially developed for matrix multiplications like OpenGL (Open Graphics Library). It uses also GPU and Direct3D (on MS Windows).

    The CUDA platform is designed to work with programming languages such as C, C++, and Fortran. This accessibility makes it easier for specialists in parallel programming to use GPU resources, in contrast to prior APIs like Direct3D and OpenGL, which required advanced skills in graphics programming. Also, CUDA supports programming frameworks such as OpenACC and OpenCL.

    Example of CUDA processing flow:

    1. Copy data from main memory to GPU memory
    2. CPU initiates the GPU compute kernel
    3. GPU's CUDA cores execute the kernel in parallel
    4. Copy the resulting data from GPU memory to main memory

    Comparing CPU and GPU Execution Speeds

    We ran a benchmark in which we measured the amount of time it took to execute 50 time steps for grid sizes of 64, 128, 512, 1024, and 2048 on an Intel Xeon Processor X5650 and then using an NVIDIA Tesla C2050 GPU.

    For a grid size of 2048, the algorithm shows a 7.5x decrease in compute time from more than a minute on the CPU to less than 10 seconds on the GPU. The log scale plot shows that the CPU is actually faster for small grid sizes. As the technology evolves and matures, however, GPU solutions are increasingly able to handle smaller problems, a trend that we expect to continue.

    Source

    From introduction for CUDA C Programming Guide:

    Driven by the insatiable market demand for realtime, high-definition 3D graphics, the programmable Graphic Processor Unit or GPU has evolved into a highly parallel, multithreaded, manycore processor with tremendous computational horsepower and very high memory bandwidth, as illustrated by Figure 1 and Figure 2.

    Figure 1. Floating-Point Operations per Second for the CPU and GPU

    Figure 2. Memory Bandwidth for the CPU and GPU

    The reason behind the discrepancy in floating-point capability between the CPU and the GPU is that the GPU is specialized for compute-intensive, highly parallel computation - exactly what graphics rendering is about - and therefore designed such that more transistors are devoted to data processing rather than data caching and flow control, as schematically illustrated by Figure 3.

    Figure 3. The GPU Devotes More Transistors to Data Processing

    More specifically, the GPU is especially well-suited to address problems that can be expressed as data-parallel computations - the same program is executed on many data elements in parallel - with high arithmetic intensity - the ratio of arithmetic operations to memory operations. Because the same program is executed for each data element, there is a lower requirement for sophisticated flow control, and because it is executed on many data elements and has high arithmetic intensity, the memory access latency can be hidden with calculations instead of big data caches.

    Data-parallel processing maps data elements to parallel processing threads. Many applications that process large data sets can use a data-parallel programming model to speed up the computations. In 3D rendering, large sets of pixels and vertices are mapped to parallel threads. Similarly, image and media processing applications such as post-processing of rendered images, video encoding and decoding, image scaling, stereo vision, and pattern recognition can map image blocks and pixels to parallel processing threads. In fact, many algorithms outside the field of image rendering and processing are accelerated by data-parallel processing, from general signal processing or physics simulation to computational finance or computational biology.

    Source

    Advanced reading

    • GPUs (Graphics processing unit)
    • MATLAB
    • CUDA C Programming Guide
    • Using GPUs in MATLAB
    • Basic Linear Algebra Subprograms (BLAS)

    • Anatomy of High-Performance Matrix Multiplication, from Kazushige Goto and Robert A. Van De Geijn


    Some interesting facs

    I've written C++ matrix multiplication that is as fast as Matlab's but it took some care. (Before Matlab was using GPUs for this).

    Сitation from this answer.

提交回复
热议问题