the Difference between running time and time of obtaining results in CUDA
问题 I am trying to implement My algorithm on GPU using CUDA. this program work well but there is a problem. when I try to print out the results, they will be shown too late . here are some of my code. Assume True Results is not matter. __device__ unsigned char dev_state[128]; __device__ unsigned char GMul(unsigned char a, unsigned char b) { // Galois Field (256) Multiplication of two Bytes unsigned char p = 0; int counter; unsigned char hi_bit_set; for (counter = 0; counter < 8; counter++) { if (