Matrix multiplication via std::vector is 10 times slower than numpy

心已入冬 提交于 2020-04-14 08:43:08

问题


Although it is known that using nested std::vector to represent matrices is a bad idea, let's use it for now since it is flexible and many existing functions can handle std::vector.

I thought, in small cases, the speed difference can be ignored. But it turned out that vector<vector<double>> is 10+ times slower than numpy.dot().

Let A and B be matrices whose size is sizexsize. Assuming square matrices is just for simplicity. (We don't intend to limit discussion to the square matrices case.) We initialize each matrix in a deterministic way, and finally calculate C = A * B.

We define "calculation time" as the time elapsed just to calculate C = A * B. In other words, various overheads are not included.

Python3 code

import numpy as np
import time
import sys

if (len(sys.argv) != 2):
    print("Pass `size` as an argument.", file = sys.stderr);
    sys.exit(1);
size = int(sys.argv[1]);

A = np.ndarray((size, size));
B = np.ndarray((size, size));

for i in range(size):
    for j in range(size):
        A[i][j] = i * 3.14 + j
        B[i][j] = i * 3.14 - j

start = time.time()
C = np.dot(A, B);
print("{:.3e}".format(time.time() - start), file = sys.stderr);

C++ code

using namespace std;
#include <iostream>
#include <vector>
#include <chrono>

int main(int argc, char **argv) {

    if (argc != 2) {
        cerr << "Pass `size` as an argument.\n";
        return 1;
    }
    const unsigned size = atoi(argv[1]);

    vector<vector<double>> A(size, vector<double>(size));
    vector<vector<double>> B(size, vector<double>(size));

    for (int i = 0; i < size; ++i) {
        for (int j = 0; j < size; ++j) {
            A[i][j] = i * 3.14 + j;
            B[i][j] = i * 3.14 - j;
        }
    }

    auto start = chrono::system_clock::now();

    vector<vector<double>> C(size, vector<double>(size, /* initial_value = */ 0));
    for (int i = 0; i < size; ++i) {
        for (int j = 0; j < size; ++j) {
            for (int k = 0; k < size; ++k) {
                C[i][j] += A[i][k] * B[k][j];
            }
        }
    }

    cerr << scientific;
    cerr.precision(3);
    cerr << chrono::duration<double>(chrono::system_clock::now() - start).count() << "\n";

}

C++ code (multithreaded)

We also wrote a multithreaded version of C++ code since numpy.dot() is automatically calculated in parallel.

You can get all the codes from GitHub.

Result

C++ version is 10+ times slower than Python 3 (with numpy) version.

matrix_size: 200x200
--------------- Time in seconds ---------------
C++ (not multithreaded): 8.45e-03
         C++ (1 thread): 8.66e-03
        C++ (2 threads): 4.68e-03
        C++ (3 threads): 3.14e-03
        C++ (4 threads): 2.43e-03
               Python 3: 4.07e-04
-----------------------------------------------

matrix_size: 400x400
--------------- Time in seconds ---------------
C++ (not multithreaded): 7.011e-02
         C++ (1 thread): 6.985e-02
        C++ (2 threads): 3.647e-02
        C++ (3 threads): 2.462e-02
        C++ (4 threads): 1.915e-02
               Python 3: 1.466e-03
-----------------------------------------------

Question

Is there any way to make the C++ implementation faster?


Optimizations I Tried

  1. swap calculation order -> at most 3.5 times faster (not than numpy code but than C++ code)

  2. optimization 1 plus partial unroll -> at most 4.5 times faster, but this can be done only when size is known in advance No. As pointed out in this comment, size is not needed to be known. We can just limit the max value of loop variables of unrolled loops and process remaining elements with normal loops. See my implementation for example.

  3. optimization 2, plus minimizing the call of C[i][j] by introducing a simple variable sum -> at most 5.2 times faster. The implementation is here. This result implies std::vector::operator[] is un-ignorably slow.

  4. optimization 3, plus g++ -march=native flag -> at most 6.2 times faster (By the way, we use -O3 of course.)

  5. Optimization 3, plus reducing the call of operator [] by introducing a pointer to an element of A since A's elements are sequentially accessed in the unrolled loop. -> At most 6.2 times faster, and a little little bit faster than Optimization 4. The code is shown below.

  6. g++ -funroll-loops flag to unroll for loops -> no change

  7. g++ #pragma GCC unroll n -> no change

  8. g++ -flto flag to turn on link time optimizations -> no change

  9. Block Algorithm -> no change

  10. transpose B to avoid cache miss -> no change

  11. long linear std::vector instead of nested std::vector<std::vector>, swap calculation order, block algorithm, and partial unroll -> at most 2.2 times faster

  12. Optimization 1, plus PGO(profile-guided optimization) -> 4.7 times faster

  13. Optimization 3, plus PGO -> same as Optimization 3

  14. Optimization 3, plus g++ specific __builtin_prefetch() -> same as Optimization 3


Current Status

(originally) 13.06 times slower -> (currently) 2.10 times slower

Again, you can get all the codes on GitHub. But let us cite some codes, all of which are functions called from the multithreaded version of C++ code.

Original Code (GitHub)

void f(const vector<vector<double>> &A, const vector<vector<double>> &B, vector<vector<double>> &C, unsigned row_start, unsigned row_end) {
    const unsigned j_max = B[0].size();
    const unsigned k_max = B.size();
    for (int i = row_start; i < row_end; ++i) {
        for (int j = 0; j < j_max; ++j) {
            for (int k = 0; k < k_max; ++k) {
                C[i][j] += A[i][k] * B[k][j];
            }
        }
    }
}

Current Best Code (GitHub)

This is the implementation of the Optimization 5 above.

void f(const vector<vector<double>> &A, const vector<vector<double>> &B, vector<vector<double>> &C, unsigned row_start, unsigned row_end) {

    static const unsigned num_unroll = 5;

    const unsigned j_max = B[0].size();
    const unsigned k_max_for_unrolled_loop = B.size() / num_unroll * num_unroll;
    const unsigned k_max = B.size();

    for (int i = row_start; i < row_end; ++i) {
        for (int k = 0; k < k_max_for_unrolled_loop; k += num_unroll) {
            for (int j = 0; j < j_max; ++j) {
                const double *p = A[i].data() + k;
                double sum;
                sum = *p++ * B[k][j];
                sum += *p++ * B[k+1][j];
                sum += *p++ * B[k+2][j];
                sum += *p++ * B[k+3][j];
                sum += *p++ * B[k+4][j];
                C[i][j] += sum;
            }
        }
        for (int k = k_max_for_unrolled_loop; k < k_max; ++k) {
            const double a = A[i][k];
            for (int j = 0; j < j_max; ++j) {
                C[i][j] += a * B[k][j];
            }
        }
    }

}

We've tried many optimizations since we first posted this question. We spent whole two days struggling with this problem, and finally reached the point where we have no more idea how to optimize the current best code. We doubt more complex algorithms like Strassen's will do it better since cases we handle are not large and each operation on std::vector is so expensive that, as we've seen, just reducing the call of [] improved the performance well.

We (want to) believe we can make it better, though.


回答1:


Matrix multiplication is relativly easy to optimize. However if you want to get to decent cpu utilization it becomes tricky because you need deep knowledge of the hardware you are using. The steps to implement a fast matmul kernel are the following:

  1. Use SIMDInstructions
  2. Use Register Blocking and fetch multiple data at once
  3. Optimize for your chache lines (mainly L2 and L3)
  4. Parallelize your code to use multiple threads

Under this linke is a very good ressource, that explains all the nasty details: https://gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184d03f0

If you want more indepth advise leave a comment.



来源:https://stackoverflow.com/questions/60973065/matrix-multiplication-via-stdvector-is-10-times-slower-than-numpy

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!