Parallelization strategies for deep learning

前端 未结 2 367
遥遥无期
遥遥无期 2021-02-04 05:15

What strategies and forms of parallelization are feasible and available for training and serving a neural network?:

  • inside
2条回答
  •  走了就别回头了
    2021-02-04 06:07

    Training

    In general, there are two strategies of parallelizing model training: data parallelism and model parallelism.

    1. Data parallelism

    This strategy splits training data into N partitions, each of which will be trained on different “devices” (different CPU cores, GPUs, or even machines). In contrast to training without data parallelism which produces one gradient per minibatch, we now have N gradients for each minibatch step. The next question is how we should combine these N gradients.

    One way to do it is by averaging all the N gradients and then updating the model parameters once based on the average. This technique is called synchronous distributed SGD. By doing the average, we have a more accurate gradient, but with a cost of waiting all the devices to finish computing its own local gradient.

    Another way is by not combining the gradients — each gradient will instead be used to update the model parameters independently. So, there will be N parameter updates for each minibatch step, in contrast to only one for the previous technique. This technique is called asynchronous distributed SGD. Because it doesn't have to wait other devices to finish, the async approach will take less time to complete a minibatch step than the sync approach will do. However, the async approach will produce a more noisy gradient, so it might need to complete more minibatch steps to catch up with the performance (in terms of loss) of the sync approach.

    There are many papers proposing some improvements and optimizations on either approach, but the main idea is generally the same as described above.

    In the literature there's been some disagreement on which technique is better in practice. At the end most people now settle on the synchronous approach.

    Data Parallelism in PyTorch

    To do synchronous SGD, we can wrap our model with torch.nn.parallel.DistributedDataParallel:

    from torch.nn.parallel import DistributedDataParallel as DDP
    
    # `model` is the model we previously initialized
    model = ...
    
    # `rank` is a device number starting from 0
    model = model.to(rank)
    ddp_model = DDP(model, device_ids=[rank])
    

    Then we can train it similarly. For more details, you can refer to the official tutorial.

    For doing asynchronous SGD in PyTorch, we need to implement it more manually since there is no wrapper similar to DistributedDataParallel for it.

    Data Parallelism in TensorFlow/Keras

    For synchronous SGD, we can use tf.distribute.MirroredStrategy to wrap the model initalization:

    import tensorflow as tf
    
    strategy = tf.distribute.MirroredStrategy()
    with strategy.scope():
        model = Model(...)
        model.compile(...)
    

    Then we can train it as usual. For more details, you can refer to the official guides on Keras website and TensorFlow website.

    For asynchronous SGD, we can use tf.distribute.experimental.ParameterServerStrategy similarly.

    2. Model Parallelism

    This strategy splits the model into N parts, each of which will be computed on different devices. A common way to split the model is based on layers: different sets of layers are placed on different devices. But we can also split it more intricately depending on the model architecture.

    Model Parallelism in TensorFlow and PyTorch

    To implement model parallelism in either TensorFlow or PyTorch, the idea is the same: to move some model parameters into a different device.

    In PyTorch we can use torch.nn.Module.to method to move a module into a different device. For example, suppose we want to create two linear layers each of which is placed on a different GPU:

    import torch.nn as nn
    
    linear1 = nn.Linear(16, 8).to('cuda:0')
    linear2 = nn.Linear(8, 4).to('cuda:1')
    

    In TensorFlow we can use tf.device to place an operation into a specific device. To implement the PyTorch example above in TensorFlow:

    import tensorflow as tf
    from tensorflow.keras import layers
    
    with tf.device('/GPU:0'):
        linear1 = layers.Dense(8, input_dim=16)
    with tf.device('/GPU:1'):
        linear2 = layers.Dense(4, input_dim=8)
    

    For more details you can refer to the official PyTorch tutorial; or if you use TensorFlow you can even use a more high-level library like mesh.

    3. Hybrid: Data and Model Parallelism

    Recall that data parallelism only splits the training data, whereas model parallelism only splits the model structures. If we have a model so large that even after using either parallelism strategy it still doesn't fit in the memory, we can always do both.

    In practice most people prefer data parallelism to model parallelism since the former is more decoupled (in fact, independent) from the model architecture than the latter. That is, by using data parallelism they can change the model architecture as they like, without worrying which part of the model should be parallelized.

    Model Inference / Serving

    Parallelizing model serving is easier than parallelizing model training since the model parameters are already fixed and each request can be processed independently. Similar to scaling a regular Python web service, we can scale model serving by spawning more processes (to workaround Python's GIL) in a single machine, or even spawning more machine instances.

    When we use a GPU to serve the model, though, we need to do more work to scale it. Because of how concurrency is handled differently by a GPU compared to a CPU, in order to maximize the performance, we need to do inference request batching. The idea is when a request comes, instead of immediately processing it, we wait some timeout duration for other requests to come. When the timeout is up, even if the number of requests is only one, we batch them all to be processed on the GPU.

    In order to minimize the average request latency, we need to find the optimal timeout duration. To find it we need to observe that there is a trade-off between minimizing the timeout duration and maximizing the number of batch size. If the timeout is too low, the batch size will be small, so the GPU will be underutilized. But if the timeout is too high, the requests that come early will wait too long before they get processed. So, the optimal timeout duration depends on the model complexity (hence, the inference duration) and the average requests per second to receive.

    Implementing a scheduler to do request batching is not a trivial task, so instead of doing it manually, we'd better use TensorFlow Serving or PyTorch Serve which already supports it.


    To learn more about parallel and distributed learning, you can read this review paper.

提交回复
热议问题