I am coding an algorithm in CUDA, the most time consuming part of which is calculating the differences between a target image to a list of comparison images, each comparison