Optimizing execution of a CUDA kernel for Triangular Matrix calculation

前端 未结 1 1900
鱼传尺愫
鱼传尺愫 2021-02-09 13:37

I am developing my first Cuda application, and I have a kernel with \"below-expected throughput\", which seems to be the biggest bottleneck at the moment.

The task of th

相关标签:
1条回答
  • 2021-02-09 14:31

    I typically disregard these types of optimization questions because they are on the verge of off-topic, in my opinion. Worst still, you provide no MCVE so anyone trying to answer would have to write all their own support code to compile and benchmark your kernel. And this sort of work does require benchmarking and code analysis. But because your problem is basically a linear algebra problem (and I like linear algebra), I answered it rather than close voting it as too broad......

    With that off my chest. there are a couple of things which immediately jump out in the code which could be improved and which would probably have a material affect on the run time.

    The first is that the trip count of the inner loop is known a priori. Anytime you have a situation like that, let the compiler know. Loop unrolling and code reordering is a very powerful compiler optimization and the NVIDIA compiler is extremely good at it. If you move D into a template parameter, you can do something like this:

    template<int D>
    __device__ float esum(const float *x, const float *y)
    {
        float val = 0.f;
    #pragma unroll
        for(int i=0; i<D; i++) { 
            float diff = x[i] - y[i];
            val += diff * diff;
        }
        return val;
    }
    
    template<int D>
    __global__ 
    void vdistance0(const float * __restrict__ Y, float * __restrict__ DD, const int N)
    {
        int index = blockIdx.x * blockDim.x + threadIdx.x;
        int stride = blockDim.x * gridDim.x;
        for (int i = index; i < N * N; i += stride) {
            const int m = i / N;
            const int n = i % N;
            DD[n + N * m] = esum<D>(Y + D * n, Y + D * m);
        }
    }
    
    template __global__ void vdistance0<2>(const float *, float *, const int);
    template __global__ void vdistance0<3>(const float *, float *, const int);
    

    The compiler will inline esum and unroll the inner loop and it can then use its reordering heuristics to better interleave loads and flops to improve throughput. The resulting code has a lower register footprint too. When I run this for N=10000 and D=2, I get about 35% speed up (7.1ms versus 4.5ms on a GTX 970 with CUDA 9.1).

    But there is an even more glaringly obvious optimization than this. The calculation you are performing will produce a symmetric output matrix. You only need to do (N*N)/2 operations to compute the full matrix, rather than the N*N you are doing in your code [technically N(N/2 -1) because the diagonal entries are zero, but lets forget the diagonal for the purposes of this discussion].

    So taking a different approach and using one block to calculate each row of the upper triangular output matrix, then you can do something like this:

    struct udiag
    {
        float *p;
        int m;
    
        __device__ __host__ udiag(float *_p, int _m) : p(_p), m(_m) {};
        __device__ __host__ float* get_row(int i) { return p + (i * (i + 1)) / 2; };
    };
    
    
    template<int D>
    __global__ 
    void vdistance2(const float * __restrict__ Y, float * __restrict__ DD, const int N)
    {
         int rowid = blockIdx.x;
         int colid = threadIdx.x;
         udiag m(DD, N);
    
         for(; rowid < N; rowid += gridDim.x) {
             float* p = m.get_row(rowid);
             const float* y = Y + D * rowid;
             for(int i=colid; i < (N-rowid); i += blockDim.x) {
                 p[i] = esum<D>(y, y + D * i);
             }
        }
    }
    template __global__ void vdistance2<2>(const float *, float *, const int);
    template __global__ void vdistance2<3>(const float *, float *, const int);
    

    This uses a little helper class to encapsulate the triangle numbers needed for the addressing scheme for the upper triangular output matrix. Doing this saves an enormous amount of memory and memory bandwidth as well as reducing the total FLOP count for the calculation. If you need to do other things afterwards BLAS (and CUBLAS) supports computations on upper or lower triangular matrices. Use them. When I run this I get about 75% speedup (7.1ms versus 1.6ms on the same GTX 970).

    Huge disclaimer: All the code you see here was written during a 45 minute lunch break and as been very lightly tested. I make absolutely no claims that anything in this answer is actually correct. I have confirmed that it compiles and doesn't produce a runtime error when I run it to get profiling data. That is it. Cavaet Emptor and all that.

    0 讨论(0)
提交回复
热议问题