Thanks to the answers here yesterday, I think I now have a correct basic test of unified memory using Pascal 1080Ti. It allocates a 50GB single dimension array and adds it u
In this blogpost from Nov 2013: https://devblogs.nvidia.com/parallelforall/unified-memory-in-cuda-6/ NVIDIA writes
An important point is that a carefully tuned CUDA program that uses streams and cudaMemcpyAsync to efficiently overlap execution with data transfers may very well perform better than a CUDA program that only uses Unified Memory. Understandably so: the CUDA runtime never has as much information as the programmer does about where data is needed and when! CUDA programmers still have access to explicit device memory allocation and asynchronous memory copies to optimize data management and CPU-GPU concurrency. Unified Memory is first and foremost a productivity feature that provides a smoother on-ramp to parallel computing, without taking away any of CUDA’s features for power users.
Also in March 2014: https://devblogs.nvidia.com/parallelforall/cudacasts-episode-18-cuda-6-0-unified-memory/
CUDA 6 introduces Unified Memory, which dramatically simplifies memory management for GPU computing. Now you can focus on writing parallel kernels when porting code to the GPU, and memory management becomes an optimization.
Now, in CUDA 8 there were some improvements to Unified Memory mechanism https://devblogs.nvidia.com/parallelforall/cuda-8-features-revealed/. In particular, they say:
An important point is that CUDA programmers still have the tools they need to explicitly optimize data management and CPU-GPU concurrency where necessary: CUDA 8 introduces useful APIs for providing the runtime with memory usage hints (cudaMemAdvise()) and for explicit prefetching (cudaMemPrefetchAsync()). These tools allow the same capabilities as explicit memory copy and pinning APIs without reverting to the limitations of explicit GPU memory allocation.
So it appears that your example may be sped up using cudaMemAdvise()
/ cudaMemPrefetch()
. However even with this, explicit memory management may still have a performance edge.
Added by OP :
Performance through data locality By migrating data on demand between the CPU and GPU, Unified Memory can offer the performance of local data on the GPU, while providing the ease of use of globally shared data. The complexity of this functionality is kept under the covers of the CUDA driver and runtime, ensuring that application code is simpler to write. The point of migration is to achieve full bandwidth from each processor; the 750 GB/s of HBM2 memory bandwidth is vital to feeding the compute throughput of a GP100 GPU. With page faulting on GP100, locality can be ensured even for programs with sparse data access, where the pages accessed by the CPU or GPU cannot be known ahead of time, and where the CPU and GPU access parts of the same array allocations simultaneously.
and
Pascal also improves support for Unified Memory thanks to a larger virtual address space and a new page migration engine, enabling higher performance, oversubscription of GPU memory, and system-wide atomic memory operations.