float16 vs float32 for convolutional neural networks

后端 未结 3 817
忘了有多久
忘了有多久 2021-02-07 13:26

The standard is float32 but I\'m wondering under what conditions it\'s ok to use float16?

I\'ve compared running the same covnet with both datatypes and haven\'t noticed

相关标签:
3条回答
  • 2021-02-07 14:10

    float16 training is tricky: your model might not converge when using standard float16, but float16 does save memory, and is also faster if you are using the latest Volta GPUs. Nvidia recommends "Mixed Precision Training" in the latest doc and paper.

    To better use float16, you need to manually and carefully choose the loss_scale. If loss_scale is too large, you may get NANs and INFs; if loss_scale is too small, the model might not converge. Unfortunately, there is no common loss_scale for all models, so you have to choose it carefully for your specific model.

    If you just want to reduce the memory usage, you could also try tf. to_bfloat16, which might converge better.

    0 讨论(0)
  • 2021-02-07 14:16

    According to this study:

    Gupta, S., Agrawal, A., Gopalakrishnan, K., & Narayanan, P. (2015, June). Deep learning with limited numerical precision. In International Conference on Machine Learning (pp. 1737-1746). At: https://arxiv.org/pdf/1502.02551.pdf

    stochastic rounding was required to obtain convergence when using half-point floating precision (float16); however, when that rounding technique was used, they claimed to get very good results.

    Here's a relevant quotation from that paper:

    "A recent work (Chen et al., 2014) presents a hardware accelerator for deep neural network training that employs fixed-point computation units, but finds it necessary to use 32-bit fixed-point representation to achieve convergence while training a convolutional neural network on the MNIST dataset. In contrast, our results show that it is possible to train these networks using only 16-bit fixed-point numbers, so long as stochastic rounding is used during fixed-point computations."

    For reference, here's the citation for Chen at al., 2014:

    Chen, Y., Luo, T., Liu, S., Zhang, S., He, L., Wang, J., ... & Temam, O. (2014, December). Dadiannao: A machine-learning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (pp. 609-622). IEEE Computer Society. At: http://ieeexplore.ieee.org/document/7011421/?part=1

    0 讨论(0)
  • 2021-02-07 14:28

    Surprisingly, it's totally OK to use 16 bits, even not just for fun, but in production as well. For example, in this video Jeff Dean talks about 16-bit calculations at Google, around 52:00. A quote from the slides:

    Neural net training very tolerant of reduced precision

    Since GPU memory is the main bottleneck in ML computation, there has been a lot of research on precision reduction. E.g.

    • Gupta at al paper "Deep Learning with Limited Numerical Precision" about fixed (not floating) 16-bit training but with stochastic rounding.

    • Courbariaux at al "Training Deep Neural Networks with Low Precision Multiplications" about 10-bit activations and 12-bit parameter updates.

    • And this is not the limit. Courbariaux et al, "BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1". Here they discuss 1-bit activations and weights (though higher precision for the gradients), which makes the forward pass super fast.

    Of course, I can imagine some networks may require high precision for training, but I would recommend at least to try 16 bits for training a big network and switch to 32 bits if it proves to work worse.

    0 讨论(0)
提交回复
热议问题