Why matmul CUDA is slower with shared memory than naive?

后端 未结 0 2013
太阳男子
太阳男子 2021-02-19 08:36

I tried to understand why the optimized version take 30% more time than the normal one:

__global__
void kernel_matmul(DeviceMatrixD::In a, DeviceMatrixD::In b, De         


        
相关标签:
回答
  • 消灭零回复
提交回复
热议问题