I\'ve got a problem that I want to split across multiple CUDA devices, but I suspect my current system architecture is holding me back;
What I\'ve set up is a GPU class
What you need is a multi-threaded implementation of the map
built-in function. Here is one implementation. That, with a little modification to suit your particular needs, you get:
import threading
def cuda_map(args_list, gpu_instances):
result = [None] * len(args_list)
def task_wrapper(gpu_instance, task_indices):
for i in task_indices:
result[i] = gpu_instance.gpufunction(args_list[i])
threads = [threading.Thread(
target=task_wrapper,
args=(gpu_i, list(xrange(len(args_list)))[i::len(gpu_instances)])
) for i, gpu_i in enumerate(gpu_instances)]
for t in threads:
t.start()
for t in threads:
t.join()
return result
It is more or less the same as what you have above, with the big difference being that you don't spend time waiting for each single completion of the gpufunction
.