Why is matrix multiplication faster with numpy than with ctypes in Python?

后端 未结 6 1880
醉话见心
醉话见心 2020-11-29 00:11

I was trying to figure out the fastest way to do matrix multiplication and tried 3 different ways:

  • Pure python implementation: no surprises here.
  • Nump
相关标签:
6条回答
  • 2020-11-29 00:46

    The language used to implement a certain functionality is a bad measure of performance by itself. Often, using a more suitable algorithm is the deciding factor.

    In your case, you're using the naive approach to matrix multiplication as taught in school, which is in O(n^3). However, you can do much better for certain kinds of matrices, e.g. square matrices, spare matrices and so on.

    Have a look at the Coppersmith–Winograd algorithm (square matrix multiplication in O(n^2.3737)) for a good starting point on fast matrix multiplication. Also see the section "References", which lists some pointers to even faster methods.


    For a more earthy example of astonishing performance gains, try to write a fast strlen() and compare it to the glibc implementation. If you don't manage to beat it, read glibc's strlen() source, it has fairly good comments.

    0 讨论(0)
  • 2020-11-29 00:47

    I'm not too familiar with Numpy, but the source is on Github. Part of dot products are implemented in https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/arraytypes.c.src, which I'm assuming is translated into specific C implementations for each datatype. For example:

    /**begin repeat
     *
     * #name = BYTE, UBYTE, SHORT, USHORT, INT, UINT,
     * LONG, ULONG, LONGLONG, ULONGLONG,
     * FLOAT, DOUBLE, LONGDOUBLE,
     * DATETIME, TIMEDELTA#
     * #type = npy_byte, npy_ubyte, npy_short, npy_ushort, npy_int, npy_uint,
     * npy_long, npy_ulong, npy_longlong, npy_ulonglong,
     * npy_float, npy_double, npy_longdouble,
     * npy_datetime, npy_timedelta#
     * #out = npy_long, npy_ulong, npy_long, npy_ulong, npy_long, npy_ulong,
     * npy_long, npy_ulong, npy_longlong, npy_ulonglong,
     * npy_float, npy_double, npy_longdouble,
     * npy_datetime, npy_timedelta#
     */
    static void
    @name@_dot(char *ip1, npy_intp is1, char *ip2, npy_intp is2, char *op, npy_intp n,
               void *NPY_UNUSED(ignore))
    {
        @out@ tmp = (@out@)0;
        npy_intp i;
    
        for (i = 0; i < n; i++, ip1 += is1, ip2 += is2) {
            tmp += (@out@)(*((@type@ *)ip1)) *
                   (@out@)(*((@type@ *)ip2));
        }
        *((@type@ *)op) = (@type@) tmp;
    }
    /**end repeat**/
    

    This appears to compute one-dimensional dot products, i.e. on vectors. In my few minutes of Github browsing I was unable to find the source for matrices, but it's possible that it uses one call to FLOAT_dot for each element in the result matrix. That means the loop in this function corresponds to your inner-most loop.

    One difference between them is that the "stride" -- the difference between successive elements in the inputs -- is explicitly computed once before calling the function. In your case there is no stride, and the offset of each input is computed each time, e.g. a[i * n + k]. I would have expected a good compiler to optimise that away to something similar to the Numpy stride, but perhaps it can't prove that the step is a constant (or it's not being optimised).

    Numpy may also be doing something smart with cache effects in the higher-level code that calls this function. A common trick is to think about whether each row is contiguous, or each column -- and try to iterate over each contiguous part first. It seems difficult to be perfectly optimal, for each dot product, one input matrix must be traversed by rows and the other by columns (unless they happened to be stored in different major order). But it can at least do that for the result elements.

    Numpy also contains code to choose the implementation of certain operations, including "dot", from different basic implementations. For instance it can use a BLAS library. From discussion above it sounds like CBLAS is used. This was translated from Fortran into C. I think the implementation used in your test would be the one found in here: http://www.netlib.org/clapack/cblas/sdot.c.

    Note that this program was written by a machine for another machine to read. But you can see at the bottom that it's using an unrolled loop to process 5 elements at a time:

    for (i = mp1; i <= *n; i += 5) {
    stemp = stemp + SX(i) * SY(i) + SX(i + 1) * SY(i + 1) + SX(i + 2) * 
        SY(i + 2) + SX(i + 3) * SY(i + 3) + SX(i + 4) * SY(i + 4);
    }
    

    This unrolling factor is likely to have been picked after profiling several. But one theoretical advantage of it is that more arithmetical operations are done between each branch point, and the compiler and CPU have more choice about how to optimally schedule them to get as much instruction pipelining as possible.

    0 讨论(0)
  • 2020-11-29 00:48

    Numpy is also highly optimized code. There is an essay about parts of it in the book Beautiful Code.

    The ctypes has to go through a dynamic translation from C to Python and back that adds some overhead. In Numpy most matrix operations are done completely internal to it.

    0 讨论(0)
  • 2020-11-29 00:53

    NumPy uses a highly-optimized, carefully-tuned BLAS method for matrix multiplication (see also: ATLAS). The specific function in this case is GEMM (for generic matrix multiplication). You can look up the original by searching for dgemm.f (it's in Netlib).

    The optimization, by the way, goes beyond compiler optimizations. Above, Philip mentioned Coppersmith–Winograd. If I remember correctly, this is the algorithm which is used for most cases of matrix multiplication in ATLAS (though a commenter notes it could be Strassen's algorithm).

    In other words, your matmult algorithm is the trivial implementation. There are faster ways to do the same thing.

    0 讨论(0)
  • 2020-11-29 01:00

    The most common reason given for Fortran's speed advantage in numerical code, afaik, is that the language makes it easier to detect aliasing - the compiler can tell that the matrices being multiplied don't share the same memory, which can help improve caching (no need to be sure results are written back immediately into "shared" memory). This is why C99 introduced restrict.

    However, in this case, I wonder if also the numpy code is managing to use some special instructions that the C code is not (as the difference seems particularly large).

    0 讨论(0)
  • 2020-11-29 01:04

    The guys who wrote NumPy obviously know what they're doing.

    There are many ways to optimize matrix multiplication. For example, order you traverse the matrix affects the memory access patterns, which affect performance.
    Good use of SSE is another way to optimize, which NumPy probably employs.
    There may be more ways, which the developers of NumPy know and I don't.

    BTW, did you compile your C code with optiomization?

    You can try the following optimization for C. It does work in parallel, and I suppose NumPy does something along the same lines.
    NOTE: Only works for even sizes. With extra work, you can remove this limitation and keep the performance improvement.

    for (i = 0; i < n; i++) {
            for (j = 0; j < n; j+=2) {
                int sub1 = 0, sub2 = 0;
                for (k = 0; k < n; k++) {
                    sub1 = sub1 + a[i * n + k] * b[k * n + j];
                    sub1 = sub1 + a[i * n + k] * b[k * n + j + 1];
                }
                c[i * n + j]     = sub;
                c[i * n + j + 1] = sub;
            }
        }
    }
    
    0 讨论(0)
提交回复
热议问题