Paper：He参数初始化之《Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification》的翻译与解读

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

Comparisons with Human Performance from [22]

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

原论文：https://arxiv.org/pdf/1502.01852.pdf
作者：Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun Microsoft Research {kahe, v-xiangz, v-shren, jiansun}@microsoft.com

Rectified activation units (rectifiers) are essential for state-of-the-art neural networks. In this work, we study rectifier neural networks for image classification from two aspects. First, we propose a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit. PReLU improves model fitting with nearly zero extra computational cost and little overfitting risk. Second, we derive a robust initialization method that particularly considers the rectifier nonlinearities. This method enables us to train extremely deep rectified models directly from scratch and to investigate deeper or wider network architectures. Based on our PReLU networks (PReLU-nets), we achieve 4.94% top-5 test error on the ImageNet 2012 classification dataset. This is a 26% relative improvement over the ILSVRC 2014 winner (GoogLeNet, 6.66%). To our knowledge, our result is the first to surpass human-level performance (5.1%, Russakovsky et al.) on this visual recognition challenge.

整流激活单元(整流器)是最先进的神经网络必不可少的。本文从两个方面研究了用于图像分类的整流神经网络。首先，我们提出一个参数整流线性单元(PReLU)来概括传统的整流单元。PReLU改进了模型拟合，额外的计算成本几乎为零，过拟合风险很小。其次，我们推导了一个稳健的初始化方法，特别考虑了整流器的非线性。这种方法使我们能够直接从零开始训练极深的修正模型，并研究更深或更广的网络体系结构。基于我们的PReLU网络(PReLU-nets)，我们在ImageNet 2012分类数据集上实现了4.94%的top-5测试错误。这是一个26%的相对改善ILSVRC 2014年的赢家(GoogLeNet, 6.66%)。据我们所知，我们的结果是在这一视觉识别挑战中第一个超过人类水平的表现(5.1%，Russakovsky等人)。

4. Experiments on ImageNet 实验ImageNet

We perform the experiments on the 1000-class ImageNet 2012 dataset [22] which contains about 1.2 million training images, 50,000 validation images, and 100,000 test images (with no published labels). The results are measured by top1/top-5 error rates [22]. We only use the provided data for training. All results are evaluated on the validation set, except for the final results in Table 7, which are evaluated on the test set. The top-5 error rate is the metric officially used to rank the methods in the classification challenge [22].

我们在1000级ImageNet 2012数据集[22]上进行实验，该数据集包含约120万张训练图像、5万张验证图像和10万张测试图像(没有发布标签)。结果由top1/top 5错误率[22]测量。我们只使用提供的数据进行培训。所有结果都在验证集上进行评估，表7中的最终结果在测试集上进行评估。排名前5的错误率是官方用于对分类挑战[22]中的方法进行排序的度量。

Comparisons between ReLU and PReLU ReLU和PReLU的比较

In Table 4, we compare ReLU and PReLU on the large model A. We use the channel-wise version of PReLU. For fair comparisons, both ReLU/PReLU models are trained using the same total number of epochs, and the learning rates are also switched after running the same number of epochs. Table 4 shows the results at three scales and the multiscale combination. The best single scale is 384, possibly because it is in the middle of the jittering range [256, 512]. For the multi-scale combination, PReLU reduces the top1 error by 1.05% and the top-5 error by 0.23% compared with ReLU. The results in Table 2 and Table 4 consistently show that PReLU improves both small and large models. This improvement is obtained with almost no computational cost.

在表4中，我们在大型模型a上比较了ReLU和PReLU。为了进行公平的比较，两个ReLU/PReLU模型都使用相同数量的epoch进行训练，并且在运行相同数量的epoch后转换学习率。表4给出了三种尺度和多尺度组合的结果。最好的单标度是384，这可能是因为它位于波动范围的中间[256,512]。对于多尺度组合，PReLU与ReLU相比，top1误差减少了1.05%，top5误差减少了0.23%。表2和表4中的结果一致表明，PReLU改进了小型和大型模型。这种改进几乎不需要计算成本。

Comparisons of Single-model Results

Next we compare single-model results. We first show 10- view testing results [16] in Table 5. Here, each view is a 224-crop. The 10-view results of VGG-16 are based on our testing using the publicly released model [25] as it is not reported in [25]. Our best 10-view result is 7.38% (Table 5). Our other models also outperform the existing results. Table 6 shows the comparisons of single-model results, which are all obtained using multi-scale and multi-view (or dense) test. Our results are denoted as MSRA. Our baseline model (A+ReLU, 6.51%) is already substantially better than the best existing single-model result of 7.1% reported for VGG-19 in the latest update of [25] (arXiv v5). We believe that this gain is mainly due to our end-to-end training, without the need of pre-training shallow models.

单模型结果的比较
接下来，我们比较单一模型的结果。我们首先在表5中显示了10- view测试结果[16]。在这里，每个视图是224个作物。VGG-16的10个视图结果是基于我们使用公开发布的模型[25]进行的测试，因为它没有在[25]中报告。我们最好的10视图结果是7.38%(表5)。我们的其他模型也优于现有的结果。表6给出了单模型结果的比较，这些结果都是通过多尺度、多视图(或稠密)测试得到的。我们的结果记作MSRA。我们的基线模型(A+ReLU, 6.51%)已经大大优于目前最好的单一模型结果，即最新更新的[25](arXiv v5)中VGG-19的7.1%。我们认为这一收益主要来自于我们的端到端培训，不需要培训前的浅层模型。

Moreover, our best single model (C, PReLU) has 5.71% top-5 error. This result is even better than all previous multi-model results (Table 7). Comparing A+PReLU with B+PReLU, we see that the 19-layer model and the 22-layer model perform comparably. On the other hand, increasing the width (C vs. B, Table 6) can still improve accuracy. This indicates that when the models are deep enough, the width becomes an essential factor for accuracy.

此外，我们最好的单一模型(C, PReLU)有5.71%的前5位误差。这一结果甚至优于以往所有多模型结果(表7)。将A+PReLU与B+PReLU进行比较，我们可以看到19层模型与22层模型的性能相当。另一方面，增加宽度(C与B，表6)仍然可以提高精度。这表明，当模型足够深时，宽度成为精度的一个重要因素。

Comparisons of Multi-model Results 多模型结果的比较

We combine six models including those in Table 6. For the time being we have trained only one model with architecture C. The other models have accuracy inferior to C by considerable margins. We conjecture that we can obtain better results by using fewer stronger models.

The multi-model results are in Table 7. Our result is 4.94% top-5 error on the test set. This number is evaluated by the ILSVRC server, because the labels of the test set are not published. Our result is 1.7% better than the ILSVRC 2014 winner (GoogLeNet, 6.66% [29]), which represents a ∼26% relative improvement. This is also a ∼17% relative improvement over the latest result (Baidu, 5.98% [32]).

我们合并了6个模型，包括表6中的模型。目前，我们只培训了一个C架构的模型，其他模型的精度远远低于C架构。我们推测我们可以通过使用更少更强的模型来获得更好的结果。

多模型结果如表7所示。我们的结果是4.94%的测试集top-5错误。这个数字是由ILSVRC服务器评估的，因为测试集的标签没有发布。我们的结果比ILSVRC 2014的获胜者(GoogLeNet, 6.66%[29])好1.7%，这代表一个26%的相对改进。与最新的结果(百度，5.98%[32])相比，这也是一个约17%的相对改进。

Analysis of Results 分析的结果

Figure 4 shows some example validation images successfully classified by our method. Besides the correctly predicted labels, we also pay attention to the other four predictions in the top-5 results. Some of these four labels are other objects in the multi-object images, e.g., the “horse-cart” image (Figure 4, row 1, col 1) contains a “mini-bus” and it is also recognized by the algorithm. Some of these four labels are due to the uncertainty among similar classes, e.g., the “coucal” image (Figure 4, row 2, col 1) has predicted labels of other bird species.

Figure 6 shows the per-class top-5 error of our result (average of 4.94%) on the test set, displayed in ascending order. Our result has zero top-5 error in 113 classes - the images in these classes are all correctly classified. The three classes with the highest top-5 error are “letter opener” (49%), “spotlight” (38%), and “restaurant” (36%). The error is due to the existence of multiple objects, small objects, or large intra-class variance. Figure 5 shows some example images misclassified by our method in these three classes. Some of the predicted labels still make some sense.

In Figure 7, we show the per-class difference of top-5 error rates between our result (average of 4.94%) and our team’s in-competition result in ILSVRC 2014 (average of 8.06%). The error rates are reduced in 824 classes, unchanged in 127 classes, and increased in 49 classes.

图4显示了通过我们的方法成功分类的一些示例验证图像。除了正确的预测标签，我们还注意前5名结果中的其他4个预测。这四个标签中有一些是多目标图像中的其他对象，例如“马车”图像(图4，第一行，第1列)包含一个“面包车”，它也被算法识别。这四个标签中的一些是由于同类之间的不确定性，例如，“coucal”图像(图4，第2行，第1列)预测了其他鸟类物种的标签。

图6显示了我们的结果在测试集上的每个类的前5个错误(平均为4.94%)，按升序显示。我们的结果在113个类别中没有前5名的错误——这些类别中的图像都是正确分类的。前5名错误最多的三个类分别是“letter opener”(49%)、“spotlight”(38%)和“restaurant”(36%)。错误是由于存在多个对象、小对象或大的类内差异造成的。图5显示了我们的方法在这三个类中分类错误的一些示例图像。一些预测的标签仍然有一定的意义。

在图7中，我们展示了我们的结果(平均4.94%)和我们的团队在ILSVRC 2014中的比赛结果(平均8.06%)之间的前5名错误率的每级差异。824个类的错误率降低，127个类的错误率不变，49个类的错误率增加。

Comparisons with Human Performance from [22]

Russakovsky et al. [22] recently reported that human performance yields a 5.1% top-5 error on the ImageNet dataset. This number is achieved by a human annotator who is well trained on the validation images to be better aware of the existence of relevant classes. When annotating the test images, the human annotator is given a special interface, where each class title is accompanied by a row of 13 example training images. The reported human performance is estimated on a random subset of 1500 test images.

Our result (4.94%) exceeds the reported human-level performance. To our knowledge, our result is the first published instance of surpassing humans on this visual recognition challenge. The analysis in [22] reveals that the two major types of human errors come from fine-grained recognition and class unawareness. The investigation in [22] suggests that algorithms can do a better job on fine-grained recognition (e.g., 120 species of dogs in the dataset). The second row of Figure 4 shows some example fine-grained objects successfully recognized by our method - “coucal”, “komondor”, and “yellow lady’s slipper”.

While humans can easily recognize these objects as a bird, a dog, and a flower, it is nontrivial for most humans to tell their species. On the negative side, our algorithm still makes mistakes in cases that are not difficult for humans, especially for those requiring context understanding or high-level knowledge (e.g., the “spotlight” images in Figure 5). While our algorithm produces a superior result on this particular dataset, this does not indicate that machine vision outperforms human vision on object recognition in general. On recognizing elementary object categories (i.e., common objects or concepts in daily lives) such as the Pascal VOC task [6], machines still have obvious errors in cases that are trivial for humans. Nevertheless, we believe that our results show the tremendous potential of machine algorithms to match human-level performance on visual recognition.

Russakovsky等人的[22]最近报告说，在ImageNet数据集上，人类的性能产生5.1%的前5大错误。这个数字是由一个人工注释器实现的，该注释器对验证图像进行了良好的训练，以便更好地了解相关类的存在。在对测试图像进行注释时，human annotator将得到一个特殊的接口，其中每个类标题都带有一行13个示例训练图像。报告的人类表现是在1500个测试图像的随机子集上估计的。

我们的结果(4.94%)超过了报告的人类级别的性能。据我们所知，我们的结果是在这一视觉识别挑战中首次发表的超越人类的实例。[22]中的分析表明，两种主要的人为错误来自于细粒度的识别和类的未察觉。[22]的研究表明，算法可以更好地进行细粒度识别(例如，数据集中有120种狗)。图4的第二行显示了一些通过我们的方法成功识别的示例细粒度对象——“coucal”、“komondor”和“yellow lady’s slipper”。

虽然人类可以很容易地认出这些物体是鸟、狗和花，但对大多数人来说，辨别它们的种类并不是件小事。在消极的一面,我们的算法仍然犯错的情况下对人类来说这并不困难,特别是对于那些需要上下文理解或高级知识(例如,图5中的“焦点”图像)。而我们的算法提供优质的结果在这个特定的数据集,这并不表明机器视觉优于人类视觉目标识别。关于识别基本对象类别(即例如Pascal VOC任务[6]，机器在一些对人类来说微不足道的情况下仍然有明显的错误。尽管如此，我们相信我们的结果显示了机器算法在视觉识别方面与人类水平匹配的巨大潜力。