Can this OpenCL code be optimized?

后端 未结 1 1630
南旧
南旧 2021-01-18 10:45

I am working on a piece of OpencL code for a specialized matrix function: for a Dx1 vector v, two DxD matrices A and

相关标签:
1条回答
  • 2021-01-18 11:24

    Optimization #1: make vector __local.

    My first pass at this got a decent improvement in performance. I noticed that each vector[k] is read a total of D times, so I copied it to a __local. This is only possible because D is small enough to allow this. The kernel as you have it above suffers from a terrible ALU:fetch ratio of 0.08 on both the 5870 and the 6970 gpus. Even the slower gpus are still waiting on the memory access.

       #define D 1000
        __kernel void element_mult(
        __global float *result,
        __global const float *vector,
        __global const float *matrix,
        __global const float *matrix2,
        const float factor)
        {
            int y = get_global_id(0);
            float sum = 0;
    
            __local float vectCopy[D];
            int ls = get_local_size(0);
            int lid = get_local_id(0);
            for(int i=0;i<D;i+=ls){
                vectCopy[i+lid] = vector[i+lid];
            }
            mem_fence(CLK_LOCAL_MEM_FENCE);
    
            for(int k = 0; k < D; k++)
            {
                sum += vectCopy[k] * matrix[(y*D) + k] * matrix2[(y*D) + k ];
            }
            result[y] = sum * factor;
        }
    

    With this change, APP profiler is showing a new ALU:fetch ratio of 0.20 for the 5870 and 6970 gpus. Average times changed from 1513-->1034, and 1261-->861 on the same cards. The low end gpus are now bound by ALU instead of fetch. (greater than 4:1 ratio)

    Opimization #2: calculate each result[y] using an entire work group.

    You would have to do this id D were much larger (100k+). The idea is to get the best memory access pattern by using the work group to compute a single element of the result at a time. I defined ls (local size) to be 64 here, because it works on my hardware, as well as most vendors'. The workgroup size you use from the host-side will have to be 64 unless you change that definition. It needs to be defined to create the sum[ls] storage as __local, and I don't like passing variable sized __local vars into my kernels.

    results: 5870 ALU:fetch=0.59:1, avg=708. 6970 ALU:fetch=0.72, avg=590. According to APP profiler, this is about twice as fast as your original listing.

    #define D 1000
    #define ls 64
    __kernel void element_mult(
    __global float *result,
    __global const float *vector,
    __global const float *matrix,
    __global const float *matrix2,
    const float factor)
    {
        __local float vectCopy[D];
        int lid = get_local_id(0);
        for(int i=0;i<D;i+=ls){
            vectCopy[i+lid] = vector[i+lid];
        }
        mem_fence(CLK_LOCAL_MEM_FENCE);
    
        int ng = get_num_groups(0);
        int gid = get_group_id(0);
        int y, k;
        __local float sum[ls];
        for(y = gid; y < D; y+=ng){
            for(k = lid; k < D; k+=ls)
            {
                sum[lid] += vectCopy[k] * matrix[(y*D) + k] * matrix2[(y*D) + k ];
            }
            if(lid==0){
                result[y] = sum[0];
                for(k=1;k<ls;k++){
                    result[y] += sum[k];
                }
                result[y] *= factor;
            }
            mem_fence(CLK_LOCAL_MEM_FENCE);
        }
    }
    

    EDIT: APP profiler = AMD APP KernelAnalyzer

    0 讨论(0)
提交回复
热议问题