Why matmul CUDA is slower with shared memory than naive?

后端未结

关注

 0  2017

I tried to understand why the optimized version take 30% more time than the normal one:

__global__
void kernel_matmul(DeviceMatrixD::In a, DeviceMatrixD::In b, De


                      
              相关标签: