How to handle non-determinism when training on a GPU?

后端 未结 2 1044
逝去的感伤
逝去的感伤 2020-12-03 05:09

While tuning the hyperparameters to get my model to perform better, I noticed that the score I get (and hence the model that is created) is different every time I run the co

相关标签:
2条回答
  • 2020-12-03 05:58

    To get a MNIST network (https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py) to train deterministically on my GPU (1050Ti):

    • Set PYTHONHASHSEED='SOMESEED'. I do it before starting the python kernel.
    • Set seeds for random generators (not sure all are needed for MNIST)
        python_random.seed(42)
        np.random.seed(42)
        tf.set_random_seed(42)
    
    • Make TF select deterministic GPU algorithms Either:
        import tensorflow as tf
        from tfdeterminism import patch
        patch()
    

    Or:

        os.environ['TF_CUDNN_DETERMINISTIC']='1'
        import tensorflow as tf
    

    Note that the resulting loss is repeatable with either method for selecting deterministic algorithms from TF, but the two methods result in different losses. Also, the solution above doesn't make a more complicated model I'm using repeatable.

    Check out https://github.com/NVIDIA/framework-determinism for a more current answer.

    A side note:

    For cuda cuDNN 8.0.1, non deterministic algorithms exist for:

    (from https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html)

    • cudnnConvolutionBackwardFilter when CUDNN_CONVOLUTION_BWD_FILTER_ALGO_0 or CUDNN_CONVOLUTION_BWD_FILTER_ALGO_3 is used
    • cudnnConvolutionBackwardData when CUDNN_CONVOLUTION_BWD_DATA_ALGO_0 is used
    • cudnnPoolingBackward when CUDNN_POOLING_MAX is used
    • cudnnSpatialTfSamplerBackward
    • cudnnCTCLoss and cudnnCTCLoss_v8 when CUDNN_CTC_LOSS_ALGO_NON_DETERMINSTIC is used
    0 讨论(0)
  • 2020-12-03 06:07

    TL;DR

    • Non-determinism for a priori deterministic operations come from concurrent (multi-threaded) implementations.
    • Despite constant progress on that front, TensorFlow does not currently guarantee determinism for all of its operations. After a quick search on the internet, it seems that the situation is similar with the other major toolkits.
    • During training, unless you are debugging an issue, it is OK to have fluctuations between runs. Uncertainty is in the nature of training, and it is wise to measure it and take it into account when comparing results – even when toolkits eventually reach perfect determinism in training.

    That, but much longer

    When you see neural network operations as mathematical operations, you would expect everything to be deterministic. Convolutions, activations, cross-entropy – everything here are mathematical equations and should be deterministic. Even pseudo-random operations such as shuffling, drop-out, noise and the likes, are entirely determined by a seed.

    When you see those operations from their computational implementation, on the other hand, you see them as massively parallelized computations, which can be source of randomness unless you are very careful.

    The heart of the problem is that, when you run operations on several parallel threads, you typically do not know which thread will end first. It is not important when threads operate on their own data, so for example, applying an activation function to a tensor should be deterministic. But when those threads need to synchronize, such as when you compute a sum, then the result may depend on the order of the summation, and in turn, on the order of which thread ended first.

    From there, you have broadly speaking two options:

    • Keep non-determinism associated with simpler implementations.

    • Take extra care in the design of your parallel algorithm to reduce or remove non-determinism in your computation. The added constraint usually results in slower algorithms

    Which route takes CuDNN? Well, mostly the deterministic one. In recent releases, deterministic operations are the norm rather than the exception. But it used to offer many non-deterministic operations, and more importantly, it used to not offer some operations such as reduction, that people needed to implement themselves in CUDA with a variable degree of consideration to determinism.

    Some libraries such as theano were more ahead of this topic, by exposing early on a deterministic flag that the user could turn on or off – but as you can see from its description, it is far from offering any guarantee.

    If more, sometimes we will select some implementation that are more deterministic, but slower. In particular, on the GPU, we will avoid using AtomicAdd. Sometimes we will still use non-deterministic implementaion, e.g. when we do not have a GPU implementation that is deterministic. Also see the dnn.conv.algo* flags to cover more cases.

    In TensorFlow, the realization of the need for determinism has been rather late, but it's slow getting there – helped by the advance of CuDNN on that front also. For a long time, reductions have been non-deterministic, but now they seem to be deterministic. The fact that CuDNN introduced deterministic reductions in version 6.0 may have helped of course.

    It seems that currently, the main obstacle for TensorFlow towards determinism is the backward pass of the convolution. It is indeed one of the few operations for which CuDNN proposes a non-deterministic algorithm, labeled CUDNN_CONVOLUTION_BWD_FILTER_ALGO_0. This algorithm is still in the list of possible choices for the backward filter in TensorFlow. And since the choice of the filter seems to be based on performance, it could indeed be picked if it is more efficient. (I am not so familiar with TensorFlow's C++ code so take this with a grain of salt.)

    Is this important?

    If you are debugging an issue, determinism is not merely important: it is mandatory. You need to reproduce the steps that led to a problem. This is currently a real issue with toolkits like TensorFlow. To mitigate this problem, your only option is to debug live, adding checks and breakpoints at the correct locations – not great.

    Deployment is another aspect of things, where it is often desirable to have a deterministic behavior, in part for human acceptance. While nobody would reasonably expect a medical diagnosis algorithm to never fail, it would be awkward that a computer could give the same patient a different diagnosis depending on the run. (Although doctors themselves are not immune to this kind of variability.)

    Those reasons are rightful motivations to fix non-determinism in neural networks.

    For all other aspects, I would say that we need to accept, if not embrace, the non-deterministic nature of neural net training. For all purposes, training is stochastic. We use stochastic gradient descent, shuffle data, use random initialization and dropout – and more importantly, training data is itself but a random sample of data. From that standpoint, the fact that computers can only generate pseudo-random numbers with a seed is an artifact. When you train, your loss is a value that also comes with a confidence interval due to this stochastic nature. Comparing those values to optimize hyper-parameters while ignoring those confidence intervals does not make much sense – therefore it is vain, in my opinion, to spend too much effort fixing non-determinism in that, and many other, cases.

    0 讨论(0)
提交回复
热议问题