I am trying to measure the execution time of GPU and compare it with CPU. I wrote a simple_add function to add all elements of a short int vector. The Kernel code is:
<I did some extra tests and realized that the GPU is optimized for floating point operations. I changed the the test code as below:
void kernel simple_add(global const int * A, global const uint * B, global int* C)
{
///------------------------------------------------
/// Add 16 bits of each
int AA=A[get_global_id(0)];
int BB=B[get_global_id(0)];
float AH=0xFFFF0000 & AA;
float AL=0x0000FFFF & AA;
float BH=0xFFFF0000 & BB;
float BL=0x0000FFFF & BB;
int CL=(int)(AL*cos(AL)+BL*sin(BL))&0x0000FFFF;
int CH=(int)(AH*cos(AH)+BH*sin(BL))&0xFFFF0000;
C[get_global_id(0)]=CH|CL;
}
and got the result that I expected (about 10 time faster):
CPU time: 741046.665 micro-sec
GPU time: 54618.889 micro-sec
----------------------------------------------------
CPU time: 741788.112 micro-sec
GPU time: 54875.666 micro-sec
----------------------------------------------------
CPU time: 739975.979 micro-sec
GPU time: 54560.445 micro-sec
----------------------------------------------------
CPU time: 755848.937 micro-sec
GPU time: 54582.111 micro-sec
----------------------------------------------------
CPU time: 724100.716 micro-sec
GPU time: 56893.445 micro-sec
----------------------------------------------------
CPU time: 744476.351 micro-sec
GPU time: 54596.778 micro-sec
----------------------------------------------------
CPU time: 727787.538 micro-sec
GPU time: 54602.445 micro-sec
----------------------------------------------------
CPU time: 731132.939 micro-sec
GPU time: 54710.000 micro-sec
----------------------------------------------------
CPU time: 727899.150 micro-sec
GPU time: 54583.444 micro-sec
----------------------------------------------------
CPU time: 727089.880 micro-sec
GPU time: 54594.778 micro-sec
----------------------------------------------------
for a bit heavier floating point operations like below:
void kernel simple_add(global const int * A, global const uint * B, global int* C)
{
///------------------------------------------------
/// Add 16 bits of each
int AA=A[get_global_id(0)];
int BB=B[get_global_id(0)];
float AH=0xFFFF0000 & AA;
float AL=0x0000FFFF & AA;
float BH=0xFFFF0000 & BB;
float BL=0x0000FFFF & BB;
int CL=(int)(AL*(cos(AL)+sin(2*AL)+cos(3*AL)+sin(4*AL)+cos(5*AL)+sin(6*AL))+
BL*(cos(BL)+sin(2*BL)+cos(3*BL)+sin(4*BL)+cos(5*BL)+sin(6*BL)))&0x0000FFFF;
int CH=(int)(AH*(cos(AH)+sin(2*AH)+cos(3*AH)+sin(4*AH)+cos(5*AH)+sin(6*AH))+
BH*(cos(BH)+sin(2*BH)+cos(3*BH)+sin(4*BH)+cos(5*BH)+sin(6*BH)))&0xFFFF0000;
C[get_global_id(0)]=CH|CL;
}
The result was more or less the same:
CPU time: 3905725.933 micro-sec
GPU time: 354543.111 micro-sec
-----------------------------------------
CPU time: 3698211.308 micro-sec
GPU time: 354850.333 micro-sec
-----------------------------------------
CPU time: 3696179.243 micro-sec
GPU time: 354302.667 micro-sec
-----------------------------------------
CPU time: 3692988.914 micro-sec
GPU time: 354764.111 micro-sec
-----------------------------------------
CPU time: 3699645.146 micro-sec
GPU time: 354287.666 micro-sec
-----------------------------------------
CPU time: 3681591.964 micro-sec
GPU time: 357071.889 micro-sec
-----------------------------------------
CPU time: 3744179.707 micro-sec
GPU time: 354249.444 micro-sec
-----------------------------------------
CPU time: 3704143.214 micro-sec
GPU time: 354934.111 micro-sec
-----------------------------------------
CPU time: 3667518.628 micro-sec
GPU time: 354809.222 micro-sec
-----------------------------------------
CPU time: 3714312.759 micro-sec
GPU time: 354883.888 micro-sec
-----------------------------------------
ATI RV730 has VLIW structure so it is better to try uint4
and int4
vector types with 1/4 number of total threads (which is NumberOfAllElements/16). This would also help loading from memory faster for each work item.
Also kernel doesn't have much calculations compared to memory operations. Making buffers mapped to RAM would have better performance. Don't copy arrays, map them to memory using map/unmap enqueue commands.
If its still not faster, you can use both gpu and cpu at the same time to work on first half and second half of work to finish it in %50 time.
Also don't put clFinish in loop. Put it just after the end of loop. This way it will enqueue it much faster and it already has in-order execution so it won't start others before finishing the first item. It is in-order queue I suppose and adding clfinish after each enqueue is extra overhead. Only a single clfinish after latest kernel is enough.
ATI RV730: 64 VLIW units, each has at least 4 streaming cores. 750 MHz.
i3-2100: 2 cores(threads just for anti-bubbling) each having AVX that capable of streaming 8 operations simultaneously. So this can have 16 operations in flight. More than 3 GHz.
Simply multiplication of streaming operations with frequencies:
ATI RV730 = 192 units (more with multiply-add functions, by 5th element of each vliw)
i3-2100 = 48 units
so gpu should be at least 4x as fast(use int4, uint4). This is for simple ALU and FPU operations such as bitwise operations and multiplications. Special functions such as trancandentals performance could be different since they run only on 5th unit in each vliw.