Thanks to the answers here yesterday, I think I now have a correct basic test of unified memory using Pascal 1080Ti. It allocates a 50GB single dimension array and adds it u
The page faulting process is clearly more complicated than a pure copy of data. As a result, when you drive data to the GPU by page-faulting, it cannot compete performance-wise with a pure copy of the data.
Page faulting essentially introduces another kind of latency for the GPU to deal with. The GPU is a latency-hiding machine, but it needs for the programmer to give it the opportunity to hide latency. This can be roughly described as exposing enough parallel work.
On the surface of it, you seem to have exposed a lot of parallel work (~12B elements in your dataset). But the work intensity per byte or element retrieved is quite small, so as a result the GPU still has limited opportunity to hide the latency associated with page-faulting here. Stated another way, the GPU has an instantaneous capacity to perform latency hiding based on the maximum complement of threads that can be in flight on that GPU (upper bound: 2048 * # of SMs), and the work exposed in each thread. Unfortunately, the work exposed in each thread in your example could be trivially small - a single addition, basically.
One of the ways to help with GPU latency hiding is increasing the work per thread, and there are various techniques to do this. A good starting point would be to choose an algorithm (if possible) that has a high compute complexity. Matrix-matrix multiply is the classical example of large compute complexity per element of data.
Some suggestions in this case would be to recognize that what you are trying to do is quite orderly, and therefore not that difficult to manage from a programming point of view, by breaking up the work into pieces and managing the data transfer yourself. This will allow you to achieve the full bandwidth of the link for data transfer operations, achieve approximately full utilization of the host->device bandwidth, and (to a very small extent for this example) overlap of copy and compute. For such a straightforward and easily decomposable problem such as this, it makes sense for the programmer not to use UM/oversubscription/page-faulting.
The place where this methodology (UM/oversubscription/page-faulting) may shine, for example, would be an algorithm where it's difficult for the programmer to predict the access pattern ahead of time. Traversal of a large graph (which cannot all be in GPU memory at once) might be an example. If you had a graph traversal problem with a large amount of work for each edge traversal, then the cost as you page-fault hopping node-to-node in the graph might not be a big deal, and simplification of the programming effort (not having to manage graph data movement explicitly) might be worth the cost.
Regarding pre-fetching, it's questionable, whether it would be of much use here, even if it were available. Prefetching still essentially depends on having something else to do while the prefetch request is in flight. When you have such a low amount of work per data item to be processed, it's not clear that a clever prefetching scheme would really provide much benefit for this example. We can imagine possibly clever, complicated prefetching strategies, but such effort is probably better spent just crafting a partitioned explicit data transfer system for such a problem as this.