Asynchronous executions of CUDA memory copies and cuFFT
问题 I have a CUDA program for calculating FFTs of, let's say, size 50000 . Currently, I copy the whole array to the GPU and execute the cuFFT. Now, I am trying to optimize the programm and the NVIDIA Visual Profiler tells me to hide the memcopy by concurrency with parallel computations. My question is: Is it possible, for example, to copy the first 5000 Elements, then start calculating, then copying the next bunch of data in parallel to calculations etc? Since a DFT is basically a sum over the