Fast R-CNN | 易学教程

R-CNN is slow because it performs a ConvNet forward pass for each object proposal, without sharing computation. Spatial pyramid pooling networks (SPPnets) were proposed to speed up R-CNN by sharing computation. The SPPnet method computes a convolutional feature map for the entire input image and then classifies each object proposal using a feature vector extracted from the shared feature map. Features are extracted for a proposal by maxpooling the portion of the feature map inside the proposal into a fixed-size output (e.g., 6 x 6). Multiple output sizes are pooled and then concatenated as in spatial pyramid pooling.

R-CNN -> SPP-Net
1. R-CNN速度很慢，因为它为每个region proposals执行ConvNet正向传递，而不共享计算。
2. SPP-Net方法计算整个输入图像的卷积特征映射，然后使用从共享特征映射提取的特征向量对每个region proposals进行分类。

SPPnet also has notable drawbacks. Like R-CNN, training is a multi-stage pipeline that involves extracting features, fine-tuning a network with log loss, training SVMs, and finally fitting bounding-box regressors. Features are also written to disk. But unlike R-CNN, the fine-tuning algorithm proposed in SPP-Net cannot update the convolutional layers that precede the spatial pyramid pooling. Unsurprisingly, this limitation (fixed convolutional layers) limits the accuracy of very deep networks.

SPP-Net -> Fast R-CNN
1. SPP-Net是一个多阶段的训练过程，且每次都需要写磁盘。
2. SPP-Net无法更新前面的卷积层。

We propose a new training algorithm that fixes the disadvantages of R-CNN and SPPnet, while improving on their speed and accuracy. We call this method Fast R-CNN because it’s comparatively fast to train and test. The Fast R-CNN method has several advantages:

Higher detection quality (mAP) than R-CNN, SPPnet
Training is single-stage, using a multi-task loss
Training can update all network layers
No disk storage is required for feature caching

Fast R-CNN小结

Fast R-CNN训练是单阶段、使用多任务损失。
能更新所有网络层，且特征不需要磁盘缓存。

Fig. 1 illustrates the Fast R-CNN architecture. A Fast R-CNN network takes as input an entire image and a set of object proposals. The network first processes the whole image with several convolutional (conv) and max pooling layers to produce a conv feature map. Then, for each object proposal a region of interest (RoI) pooling layer extracts a fixed-length feature vector from the feature map. Each feature vector is fed into a sequence of fully connected (fc) layers that finally branch into two sibling output layers: one that produces softmax probability estimates over K object classes plus a catch-all “background” class and another layer that outputs four real-valued numbers for each of the K object classes. Each set of 4 values encodes refined bounding-box positions for one of the K classes.

Fast R-CNN架构
1. 将整个图像和一组object proposals集合作为输入。
2. 接着，将整个图像进行卷积、池化以产生一个Conv feature map。
3. 根据输入的object proposals，使用RoI池化层，直接从Conv feature map中提取RoI feature vector。
4. 最后，RoI feature vector进入全连接层（这时有两个并行的模块）：
  - softmax用于分类；
  - regressor用于产生每个类的边界。

First, the last max pooling layer is replaced by a RoI pooling layer that is configured by setting H and W to be compatible with the net’s first fully connected layer (e.g., H = W = 7 for VGG16).

Second, the network’s last fully connected layer and softmax (which were trained for 1000-way ImageNet classification) are replaced with the two sibling layers described earlier (a fully connected layer and softmax over K+1 categories and category-specific bounding-box regressors).

Third, the network is modified to take two data inputs: a list of images and a list of RoIs in those images.

R-CNN -> Fast R-CNN
1. 最后一个最大池层被一个RoI池层代替。
2. 最后完全连接层和softmax替换为两个并行模块。
3. 网络被修改为需要两个数据输入：图像列表和这些图像中的RoI列表。

Training all network weights with back-propagation is an important capability of Fast R-CNN. First, let’s elucidate why SPPnet is unable to update weights below the spatial pyramid pooling layer. The root cause is that back-propagation through the SPP layer is highly inefficient when each training sample (i.e. RoI) comes from a different image, which is exactly how R-CNN and SPPnet networks are trained. The inefficiency stems from the fact that each RoI may have a very large receptive field, often spanning the entire input image. Since the forward pass must process the entire receptive field, the training inputs are large (often the entire image).

SPP-Net为何无法更新金字塔池层前的卷积层
1. 根本原因在于当训练样本来自不同图像时，反向传播效率特别低。
2. 效率低下的原因在于每个RoI可能有一个非常大的接受域，通常跨越整个输入图像。

We propose a more efficient training method that takes advantage of feature sharing during training. In Fast R-CNN training, stochastic gradient descent (SGD) minibatches are sampled hierarchically, first by sampling N images and then by sampling R/N RoIs from each image. Critically, RoIs from the same image share computation and memory in the forward and backward passes. Making N small decreases mini-batch computation.

Fast R-CNN解决反向传播效率低问题
1. 特征共享。
2. 分级采样。

References:
[1] Girshick R. Fast R-CNN[J]. Computer Science, 2015.

@qingdujun
2018-5-25 于北京怀柔

文章来源: Fast R-CNN

标签

fast

softmax