I tried to understand why the optimized version take 30% more time than the normal one:
__global__ void kernel_matmul(DeviceMatrixD::In a, DeviceMatrixD::In b, De