I have started learning cuda for a while and I have the following problem
See how I am doing below:
Copy GPU
int* B;
// ...
As for your second question
B[ind(tid,1,Nel)]=j// j in most cases do no go all the way to the Nel reach
When performing calculation on the GPU, due to sync reasons, every thread which has finished his job does not perform any calculations until all the thread in the same workgroup have finished.
In other words, the time you need to perform this calculation will be that of the worst case, it doesn't matter if most of the threads don't go all the way down.
I am not sure about your first question, how do you measure the time? I am not too familiar with cuda, but I think that when copying from CPU to GPU the implementation bufferize your data, hiding the effective time involved.