Can this OpenCL code be optimized?

后端 未结 1 1631
南旧
南旧 2021-01-18 10:45

I am working on a piece of OpencL code for a specialized matrix function: for a Dx1 vector v, two DxD matrices A and

1条回答
  •  臣服心动
    2021-01-18 11:24

    Optimization #1: make vector __local.

    My first pass at this got a decent improvement in performance. I noticed that each vector[k] is read a total of D times, so I copied it to a __local. This is only possible because D is small enough to allow this. The kernel as you have it above suffers from a terrible ALU:fetch ratio of 0.08 on both the 5870 and the 6970 gpus. Even the slower gpus are still waiting on the memory access.

       #define D 1000
        __kernel void element_mult(
        __global float *result,
        __global const float *vector,
        __global const float *matrix,
        __global const float *matrix2,
        const float factor)
        {
            int y = get_global_id(0);
            float sum = 0;
    
            __local float vectCopy[D];
            int ls = get_local_size(0);
            int lid = get_local_id(0);
            for(int i=0;i

    With this change, APP profiler is showing a new ALU:fetch ratio of 0.20 for the 5870 and 6970 gpus. Average times changed from 1513-->1034, and 1261-->861 on the same cards. The low end gpus are now bound by ALU instead of fetch. (greater than 4:1 ratio)

    Opimization #2: calculate each result[y] using an entire work group.

    You would have to do this id D were much larger (100k+). The idea is to get the best memory access pattern by using the work group to compute a single element of the result at a time. I defined ls (local size) to be 64 here, because it works on my hardware, as well as most vendors'. The workgroup size you use from the host-side will have to be 64 unless you change that definition. It needs to be defined to create the sum[ls] storage as __local, and I don't like passing variable sized __local vars into my kernels.

    results: 5870 ALU:fetch=0.59:1, avg=708. 6970 ALU:fetch=0.72, avg=590. According to APP profiler, this is about twice as fast as your original listing.

    #define D 1000
    #define ls 64
    __kernel void element_mult(
    __global float *result,
    __global const float *vector,
    __global const float *matrix,
    __global const float *matrix2,
    const float factor)
    {
        __local float vectCopy[D];
        int lid = get_local_id(0);
        for(int i=0;i

    EDIT: APP profiler = AMD APP KernelAnalyzer

    0 讨论(0)
提交回复
热议问题