I am trying to measure the execution time of GPU and compare it with CPU. I wrote a simple_add function to add all elements of a short int vector. The Kernel code is:
<
I did some extra tests and realized that the GPU is optimized for floating point operations. I changed the the test code as below:
void kernel simple_add(global const int * A, global const uint * B, global int* C)
{
///------------------------------------------------
/// Add 16 bits of each
int AA=A[get_global_id(0)];
int BB=B[get_global_id(0)];
float AH=0xFFFF0000 & AA;
float AL=0x0000FFFF & AA;
float BH=0xFFFF0000 & BB;
float BL=0x0000FFFF & BB;
int CL=(int)(AL*cos(AL)+BL*sin(BL))&0x0000FFFF;
int CH=(int)(AH*cos(AH)+BH*sin(BL))&0xFFFF0000;
C[get_global_id(0)]=CH|CL;
}
and got the result that I expected (about 10 time faster):
CPU time: 741046.665 micro-sec
GPU time: 54618.889 micro-sec
----------------------------------------------------
CPU time: 741788.112 micro-sec
GPU time: 54875.666 micro-sec
----------------------------------------------------
CPU time: 739975.979 micro-sec
GPU time: 54560.445 micro-sec
----------------------------------------------------
CPU time: 755848.937 micro-sec
GPU time: 54582.111 micro-sec
----------------------------------------------------
CPU time: 724100.716 micro-sec
GPU time: 56893.445 micro-sec
----------------------------------------------------
CPU time: 744476.351 micro-sec
GPU time: 54596.778 micro-sec
----------------------------------------------------
CPU time: 727787.538 micro-sec
GPU time: 54602.445 micro-sec
----------------------------------------------------
CPU time: 731132.939 micro-sec
GPU time: 54710.000 micro-sec
----------------------------------------------------
CPU time: 727899.150 micro-sec
GPU time: 54583.444 micro-sec
----------------------------------------------------
CPU time: 727089.880 micro-sec
GPU time: 54594.778 micro-sec
----------------------------------------------------
for a bit heavier floating point operations like below:
void kernel simple_add(global const int * A, global const uint * B, global int* C)
{
///------------------------------------------------
/// Add 16 bits of each
int AA=A[get_global_id(0)];
int BB=B[get_global_id(0)];
float AH=0xFFFF0000 & AA;
float AL=0x0000FFFF & AA;
float BH=0xFFFF0000 & BB;
float BL=0x0000FFFF & BB;
int CL=(int)(AL*(cos(AL)+sin(2*AL)+cos(3*AL)+sin(4*AL)+cos(5*AL)+sin(6*AL))+
BL*(cos(BL)+sin(2*BL)+cos(3*BL)+sin(4*BL)+cos(5*BL)+sin(6*BL)))&0x0000FFFF;
int CH=(int)(AH*(cos(AH)+sin(2*AH)+cos(3*AH)+sin(4*AH)+cos(5*AH)+sin(6*AH))+
BH*(cos(BH)+sin(2*BH)+cos(3*BH)+sin(4*BH)+cos(5*BH)+sin(6*BH)))&0xFFFF0000;
C[get_global_id(0)]=CH|CL;
}
The result was more or less the same:
CPU time: 3905725.933 micro-sec
GPU time: 354543.111 micro-sec
-----------------------------------------
CPU time: 3698211.308 micro-sec
GPU time: 354850.333 micro-sec
-----------------------------------------
CPU time: 3696179.243 micro-sec
GPU time: 354302.667 micro-sec
-----------------------------------------
CPU time: 3692988.914 micro-sec
GPU time: 354764.111 micro-sec
-----------------------------------------
CPU time: 3699645.146 micro-sec
GPU time: 354287.666 micro-sec
-----------------------------------------
CPU time: 3681591.964 micro-sec
GPU time: 357071.889 micro-sec
-----------------------------------------
CPU time: 3744179.707 micro-sec
GPU time: 354249.444 micro-sec
-----------------------------------------
CPU time: 3704143.214 micro-sec
GPU time: 354934.111 micro-sec
-----------------------------------------
CPU time: 3667518.628 micro-sec
GPU time: 354809.222 micro-sec
-----------------------------------------
CPU time: 3714312.759 micro-sec
GPU time: 354883.888 micro-sec
-----------------------------------------