My opencl test does not run much faster than CPU

后端 未结 2 1423
南笙
南笙 2021-01-13 13:10

I am trying to measure the execution time of GPU and compare it with CPU. I wrote a simple_add function to add all elements of a short int vector. The Kernel code is:

<
相关标签:
2条回答
  • 2021-01-13 13:33

    I did some extra tests and realized that the GPU is optimized for floating point operations. I changed the the test code as below:

    void kernel simple_add(global const int * A, global const uint * B, global int* C)
        {
            ///------------------------------------------------
            /// Add 16 bits of each
            int AA=A[get_global_id(0)];
            int BB=B[get_global_id(0)];
            float AH=0xFFFF0000 & AA;
            float AL=0x0000FFFF & AA;
            float BH=0xFFFF0000 & BB;
            float BL=0x0000FFFF & BB;
            int CL=(int)(AL*cos(AL)+BL*sin(BL))&0x0000FFFF;
            int CH=(int)(AH*cos(AH)+BH*sin(BL))&0xFFFF0000;
               C[get_global_id(0)]=CH|CL;               
         }
    

    and got the result that I expected (about 10 time faster):

                    CPU time:      741046.665  micro-sec
                    GPU time:       54618.889  micro-sec
                    ----------------------------------------------------
                    CPU time:      741788.112  micro-sec
                    GPU time:       54875.666  micro-sec
                    ----------------------------------------------------
                    CPU time:      739975.979  micro-sec
                    GPU time:       54560.445  micro-sec
                    ----------------------------------------------------
                    CPU time:      755848.937  micro-sec
                    GPU time:       54582.111  micro-sec
                    ----------------------------------------------------
                    CPU time:      724100.716  micro-sec
                    GPU time:       56893.445  micro-sec
                    ----------------------------------------------------
                    CPU time:      744476.351  micro-sec
                    GPU time:       54596.778  micro-sec
                    ----------------------------------------------------
                    CPU time:      727787.538  micro-sec
                    GPU time:       54602.445  micro-sec
                    ----------------------------------------------------
                    CPU time:      731132.939  micro-sec
                    GPU time:       54710.000  micro-sec
                    ----------------------------------------------------
                    CPU time:      727899.150  micro-sec
                    GPU time:       54583.444  micro-sec
                    ----------------------------------------------------
                    CPU time:      727089.880  micro-sec
                    GPU time:       54594.778  micro-sec
                    ----------------------------------------------------
    

    for a bit heavier floating point operations like below:

            void kernel simple_add(global const int * A, global const uint * B, global int* C)
                {
                    ///------------------------------------------------
                    /// Add 16 bits of each
                    int AA=A[get_global_id(0)];
                    int BB=B[get_global_id(0)];
                    float AH=0xFFFF0000 & AA;
                    float AL=0x0000FFFF & AA;
                    float BH=0xFFFF0000 & BB;
                    float BL=0x0000FFFF & BB;
                    int CL=(int)(AL*(cos(AL)+sin(2*AL)+cos(3*AL)+sin(4*AL)+cos(5*AL)+sin(6*AL))+
                            BL*(cos(BL)+sin(2*BL)+cos(3*BL)+sin(4*BL)+cos(5*BL)+sin(6*BL)))&0x0000FFFF;
                    int CH=(int)(AH*(cos(AH)+sin(2*AH)+cos(3*AH)+sin(4*AH)+cos(5*AH)+sin(6*AH))+
                            BH*(cos(BH)+sin(2*BH)+cos(3*BH)+sin(4*BH)+cos(5*BH)+sin(6*BH)))&0xFFFF0000;
                            C[get_global_id(0)]=CH|CL;
    
                 }
    

    The result was more or less the same:

                    CPU time:     3905725.933  micro-sec
                    GPU time:      354543.111  micro-sec
                    -----------------------------------------
                    CPU time:     3698211.308  micro-sec
                    GPU time:      354850.333  micro-sec
                    -----------------------------------------
                    CPU time:     3696179.243  micro-sec
                    GPU time:      354302.667  micro-sec
                    -----------------------------------------
                    CPU time:     3692988.914  micro-sec
                    GPU time:      354764.111  micro-sec
                    -----------------------------------------
                    CPU time:     3699645.146  micro-sec
                    GPU time:      354287.666  micro-sec
                    -----------------------------------------
                    CPU time:     3681591.964  micro-sec
                    GPU time:      357071.889  micro-sec
                    -----------------------------------------
                    CPU time:     3744179.707  micro-sec
                    GPU time:      354249.444  micro-sec
                    -----------------------------------------
                    CPU time:     3704143.214  micro-sec
                    GPU time:      354934.111  micro-sec
                    -----------------------------------------
                    CPU time:     3667518.628  micro-sec
                    GPU time:      354809.222  micro-sec
                    -----------------------------------------
                    CPU time:     3714312.759  micro-sec
                    GPU time:      354883.888  micro-sec
                    -----------------------------------------
    
    0 讨论(0)
  • 2021-01-13 13:37

    ATI RV730 has VLIW structure so it is better to try uint4 and int4 vector types with 1/4 number of total threads (which is NumberOfAllElements/16). This would also help loading from memory faster for each work item.

    Also kernel doesn't have much calculations compared to memory operations. Making buffers mapped to RAM would have better performance. Don't copy arrays, map them to memory using map/unmap enqueue commands.

    If its still not faster, you can use both gpu and cpu at the same time to work on first half and second half of work to finish it in %50 time.

    Also don't put clFinish in loop. Put it just after the end of loop. This way it will enqueue it much faster and it already has in-order execution so it won't start others before finishing the first item. It is in-order queue I suppose and adding clfinish after each enqueue is extra overhead. Only a single clfinish after latest kernel is enough.


    ATI RV730: 64 VLIW units, each has at least 4 streaming cores. 750 MHz.

    i3-2100: 2 cores(threads just for anti-bubbling) each having AVX that capable of streaming 8 operations simultaneously. So this can have 16 operations in flight. More than 3 GHz.

    Simply multiplication of streaming operations with frequencies:

    ATI RV730 = 192 units (more with multiply-add functions, by 5th element of each vliw)

    i3-2100 = 48 units

    so gpu should be at least 4x as fast(use int4, uint4). This is for simple ALU and FPU operations such as bitwise operations and multiplications. Special functions such as trancandentals performance could be different since they run only on 5th unit in each vliw.

    0 讨论(0)
提交回复
热议问题