I am trying to figure out if using cudaHostAlloc (or cudaMallocHost?) is appropriate.
I am trying to run a kernel where my input data is more than the amount availab
Using host memory would be orders of magnitude slower than on-device memory. It has both very high latency and very limited throughput. For example capacity of PCIe x16 is mere 8GB/s when bandwidth of device memory on GTX460 is 108GB/s