I\'m trying to improve the performance of my PyTorch code by using a preallocated page locked tensor to copy all my results from the GPU into. The results from the GPU will