Paper：Xavier参数初始化之《Understanding the difficulty of training deep feedforward neural networks》的翻译与解读

Understanding the difficulty of training deep feedforward neural networks

原论文地址：http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf?hc_location=ufi
作者：Xavier Glorot Yoshua Bengio DIRO, Universite de Montr ´ eal, Montr ´ eal, Qu ´ ebec, Canada

Abstract

Whereas before 2006 it appears that deep multilayer neural networks were not successfully trained, since then several algorithms have been shown to successfully train them, with experimental results showing the superiority of deeper vs less deep architectures. All these experimental results were obtained with new initialization or training mechanisms. Our objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future. We first observe the influence of the non-linear activations functions. We find that the logistic sigmoid activation is unsuited for deep networks with random initialization because of its mean value, which can drive especially the top hidden layer into saturation. Surprisingly, we find that saturated units can move out of saturation by themselves, albeit slowly, and explaining the plateaus sometimes seen when training neural networks. We find that a new non-linearity that saturates less can often be beneficial. Finally, we study how activations and gradients vary across layers and during training, with the idea that training may be more difficult when the singular values of the Jacobian associated with each layer are far from 1. Based on these considerations, we propose a new initialization scheme that brings substantially faster convergence.

在2006年之前，深层多层神经网络似乎并没有被成功地训练，从那时起，一些算法已经被证明能够成功地训练它们，实验结果显示了深层结构与非深层结构的优越性。所有这些实验结果都是使用新的初始化或训练机制得到的。我们的目标是更好地理解为什么深度神经网络在随机初始化的标准梯度下降中表现如此糟糕，更好地理解这些最近的相对成功，并在未来帮助设计更好的算法。我们首先观察非线性激活函数的影响。我们发现逻辑s型激活不适合具有随机初始化的深度网络，因为它的均值会使深度网络特别是最顶层的隐藏层达到饱和。令人惊讶的是，我们发现饱和的单位可以自己走出饱和状态，尽管速度很慢，这也解释了为什么在训练神经网络时有时会出现停滞状态。我们发现，一个新的非线性，饱和少往往是有益的。最后，我们研究了激活度和梯度在层间和训练过程中的变化，认为当与每一层相关的雅可比矩阵的奇异值远离1时，训练可能会更加困难。基于这些考虑，我们提出了一种新的初始化方案，该方案大大加快了收敛速度。

5 Error Curves and Conclusions 误差曲线及结论

The final consideration that we care for is the success of training with different strategies, and this is best illustrated with error curves showing the evolution of test error as training progresses and asymptotes. Figure 11 shows such curves with online training on Shapeset-3 × 2, while Table 1 gives final test error for all the datasets studied (Shapeset-3 × 2, MNIST, CIFAR-10, and SmallImageNet). As a baseline, we optimized RBF SVM models on one hundred thousand Shapeset examples and obtained 59.47% test error, while on the same set we obtained 50.47% with a depth five hyperbolic tangent network with normalized initialization.	我们关心的最后一个问题是不同策略下的训练是否成功，这可以用误差曲线来最好地说明，误差曲线显示了测试误差随训练的进展和渐近线的演变。图11显示了对Shapeset-3×2进行在线训练后的曲线，表1给出了所有研究数据集的最终测试误差(Shapeset-3×2、MNIST、ci- 10和SmallImageNet)。作为基线，我们对10万个Shapeset样本的RBF SVM模型进行了优化，得到了59.47%的测试误差，而在同一组样本中，我们得到了50.47%的深度5双曲正切网络，并进行了归一化初始化。
	Figure 8: Weight gradient normalized histograms with hyperbolic tangent activation just after initialization, with standard initialization (top) and normalized initialization (bottom), for different layers. Even though with standard initialization the back-propagated gradients get smaller, the weight gradients do not! 图8:不同层的权重梯度归一化直方图，初始化后使用双曲正切激活，标准初始化(顶部)和归一化初始化(底部)。即使使用标准的初始化，反向传播的梯度也会变小，但是权值梯度不会变小! Figure 9: Standard deviation intervals of the weights gradients with hyperbolic tangents with standard initialization (top) and normalized (bottom) during training. We see that the normalization allows to keep the same variance of the weights gradient across layers, during training (top: smaller variance for higher layers). 图9:训练过程中，带有标准初始化的双曲切线权值梯度(上)和归一化(下)的标准差区间。我们可以看到，在训练过程中，规范化允许在不同层之间保持相同的权值梯度的方差(顶部:更高层的方差更小)。 Table 1: Test error with different activation functions and initialization schemes for deep networks with 5 hidden layers. N after the activation function name indicates the use of normalized initialization. Results in bold are statistically different from non-bold ones under the null hypothesis test with p = 0.005. 表1:不同激活函数和初始化方案对5个隐含层的深度网络的测试误差。激活函数名后的N表示使用规范化初始化。在p = 0.005的原假设检验下，粗体的结果与非粗体的结果有统计学差异。 Figure 10: 98 percentile (markers alone) and standard deviation (solid lines with markers) of the distribution of activation values for hyperbolic tangent with normalized initialization during learning. 图10:学习过程中正切激活值分布的98个百分位(单独标记)和标准差(带有标记的实线)。
These results illustrate the effect of the choice of activation and initialization. As a reference we include in Figure 11 the error curve for the supervised fine-tuning from the initialization obtained after unsupervised pre-training with denoising auto-encoders (Vincent et al., 2008). For each network the learning rate is separately chosen to minimize error on the validation set. We can remark that on Shapeset-3 × 2, because of the task difficulty, we observe important saturations during learning, this might explain that the normalized initialization or the softsign effects are more visible	这些结果说明了激活和初始化选择的影响。作为参考，我们在图11中包括了经过去噪自动编码器的无监督预训练后获得的初始化的监督微调的误差曲线(Vincent et al.， 2008)。对于每个网络，我们分别选择学习速率来最小化验证集上的错误。我们可以注意到，在Shapeset-3×2上，由于任务的难度，我们观察到了学习过程中的重要饱和，这可能解释了规范化初始化或软标记效应更明显
Several conclusions can be drawn from these error curves: • The more classical neural networks with sigmoid or hyperbolic tangent units and standard initialization fare rather poorly, converging more slowly and apparently towards ultimately poorer local minima. • The softsign networks seem to be more robust to the initialization procedure than the tanh networks, presumably because of their gentler non-linearity. • For tanh networks, the proposed normalized initialization can be quite helpful, presumably because the layer-to-layer transformations maintain magnitudes of activations (flowing upward) and gradients (flowing backward). Others methods can alleviate discrepancies between layers during learning, e.g., exploiting second order information to set the learning rate separately for each parameter. For example, we can exploit the diagonal of the Hessian (LeCun et al., 1998b) or a gradient variance estimate. Both those methods have been applied for Shapeset-3 × 2 with hyperbolic tangent and standard initialization. We observed a gain in performance but not reaching the result obtained from normalized initialization. In addition, we observed further gains by combining normalized initialization with second order methods: the estimated Hessian might then focus on discrepancies between units, not having to correct important initial discrepancies between layers.	从这些误差曲线可以得出以下几点结论: •具有s形或双曲正切单元和标准初始化的更经典的神经网络表现相当差，收敛速度更慢，而且显然最终会导致更差的局部极小值。 •软信号网络在初始化过程中似乎比tanh网络更健壮，可能是因为它们的非线性更温和。 •对于tanh网络，建议的规范化初始化可能非常有帮助，可能是因为层到层的转换保持激活量(向上流动)和梯度(向后流动)。其他方法可以缓解学习过程中各层之间的差异，如利用二阶信息分别设置各参数的学习速率。例如，我们可以利用Hessian (LeCun et al.， 1998b)或梯度方差估计的对角线。这两种方法都应用于双曲正切和标准初始化的3×2形状。我们观察到性能上的提高，但没有达到规范化初始化得到的结果。此外，通过将规范化初始化与二阶方法相结合，我们还观察到了进一步的收获:估计的Hessian可能会关注单元之间的差异，而不必纠正层之间重要的初始差异。
	Figure 11: Test error during online training on the Shapeset-3×2 dataset, for various activation functions and initialization schemes (ordered from top to bottom in decreasing final error). N after the activation function name indicates the use of normalized initialization. 图11:在Shapeset-3×2数据集上进行在线训练时，各种激活函数和初始化方案的测试误差(为了减少最终误差，从上到下排序)。激活函数名后的N表示使用规范化初始化。 Figure 12: Test error curves during training on MNIST and CIFAR10, for various activation functions and initialization schemes (ordered from top to bottom in decreasing final error). N after the activation function name indicates the use of normalized initialization. 图12:MNIST和CIFAR10上的各种激活函数和初始化方案的训练误差曲线(从上到下排序以减少最终误差)。激活函数名后的N表示使用规范化初始化。
In all reported experiments we have used the same number of units per layer. However, we verified that we obtain the same gains when the layer size increases (or decreases) with layer number. The other conclusions from this study are the following: • Monitoring activations and gradients across layers and training iterations is a powerful investigative tool for understanding training difficulties in deep nets. • Sigmoid activations (not symmetric around 0) should be avoided when initializing from small random weights, because they yield poor learning dynamics, with initial saturation of the top hidden layer. • Keeping the layer-to-layer transformations such that both activations and gradients flow well (i.e. with a Jacobian around 1) appears helpful, and allows to eliminate a good part of the discrepancy between purely supervised deep networks and ones pre-trained with unsupervised learning. • Many of our observations remain unexplained, suggesting further investigations to better understand gradients and training dynamics in deep architectures.	在所有报告的实验中，我们使用了相同数量的单位每层。但是，我们验证了随着层数的增加(或减少)，我们获得了相同的收益。本研究的其他结论如下: •跨层监控激活和梯度以及训练迭代是理解深层网络中训练困难的一个强大的调查工具。 •从较小的随机权值初始化时应避免Sigmoid激活(不是围绕0对称的)，因为它们会产生较差的学习动态，且初始隐藏层已饱和。 •保持分层之间的转换，使激活和梯度都能很好地流动(即雅可比矩阵在1附近)，这看起来很有帮助，并允许消除纯监督深度网络和非监督学习的预训练网络之间的很大一部分差异。 •我们的许多观察结果仍然无法解释，这意味着需要进一步的研究来更好地理解深层架构中的梯度和训练动态。