问题
I'm calculating the accumulated distance between each pair of kernel inside a nn.Conv2d layer. However for large layers it runs out of memory using a Titan X with 12gb of memory. I'd like to know if it is possible to divide such calculations across two gpus. The code follows:
def ac_distance(layer):
total = 0
for p in layer.weight:
for q in layer.weight:
total += distance(p,q)
return total
Where layer
is instance of nn.Conv2d
and distance returns the sum of the differences between p and q. I can't detach the graph, however, for I need it later on. I tried wrapping my model around a nn.DataParallel, but all calculations in ac_distance
are done using only 1 gpu, however it trains using both.
回答1:
Parallelism while training neural networks can be achieved in two ways.
- Data Parallelism - Split a large batch into two and do the same set of operations but individually on two different GPUs respectively
- Model Parallelism - Split the computations and run them on different GPUs
As you have asked in the question, you would like to split the calculation which falls into the second category. There are no out-of-the-box ways to achieve model parallelism. PyTorch provides primitives for parallel processing using the torch.distributed
package. This tutorial comprehensively goes through the details of the package and you can cook up an approach to achieve model parallelism that you need.
However, model parallelism can be very complex to achieve. The general way is to do data parallelism with either torch.nn.DataParallel
or torch.nn.DistributedDataParallel
. In both the methods, you would run the same model on two different GPUs, however one huge batch would be split into two smaller chunks. The gradients will be accumulated on a single GPU and optimization happens. Optimization takes place on a single GPU in Dataparallel
and parallely across GPUs in DistributedDataParallel
by using multiprocessing.
In your case, if you use DataParallel
, the computation would still take place on two different GPUs. If you notice imbalance in GPU usage it could be because of the way DataParallel
has been designed. You can try using DistributedDataParallel
which is the fastest way to train on multiple GPUs according to the docs.
There are other ways to process very large batches too. This article goes through them in detail and I'm sure it would be helpful. Few important points:
- Do gradient accumulation for larger batches
- Use DataParallel
- If that doesn't suffice, go with DistributedDataParallel
来源:https://stackoverflow.com/questions/55624102/using-multiple-gpus-outside-of-training-in-pytorch