R-CNN is slow because it performs a ConvNet forward pass for each object proposal, without sharing computation. Spatial pyramid pooling networks (SPPnets) were proposed to speed up R-CNN by sharing computation. The SPPnet method computes a convolutional feature map for the entire input image and then classifies each object proposal using a feature vector extracted from the shared feature map. Features are extracted for a proposal by maxpooling the portion of the feature map inside the proposal into a fixed-size output (e.g., 6 x 6). Multiple output sizes are pooled and then concatenated as in spatial pyramid pooling.
- R-CNN -> SPP-Net
- R-CNN速度很慢,因为它为每个region proposals执行ConvNet正向传递,而不共享计算。
- SPP-Net方法计算整个输入图像的卷积特征映射,然后使用从共享特征映射提取的特征向量对每个region proposals进行分类。
SPPnet also has notable drawbacks. Like R-CNN, training is a multi-stage pipeline that involves extracting features, fine-tuning a network with log loss, training SVMs, and finally fitting bounding-box regressors. Features are also written to disk. But unlike R-CNN, the fine-tuning algorithm proposed in SPP-Net cannot update the convolutional layers that precede the spatial pyramid pooling. Unsurprisingly, this limitation (fixed convolutional layers) limits the accuracy of very deep networks.
- SPP-Net -> Fast R-CNN
- SPP-Net是一个多阶段的训练过程,且每次都需要写磁盘。
- SPP-Net无法更新前面的卷积层。
We propose a new training algorithm that fixes the disadvantages of R-CNN and SPPnet, while improving on their speed and accuracy. We call this method Fast R-CNN because it’s comparatively fast to train and test. The Fast R-CNN method has several advantages:
- Higher detection quality (mAP) than R-CNN, SPPnet
- Training is single-stage, using a multi-task loss
- Training can update all network layers
- No disk storage is required for feature caching
Fast R-CNN小结
- Fast R-CNN训练是单阶段、使用多任务损失。
- 能更新所有网络层,且特征不需要磁盘缓存。
Fig. 1 illustrates the Fast R-CNN architecture. A Fast R-CNN network takes as input an entire image and a set of object proposals. The network first processes the whole image with several convolutional (conv) and max pooling layers to produce a conv feature map. Then, for each object proposal a region of interest (RoI) pooling layer extracts a fixed-length feature vector from the feature map. Each feature vector is fed into a sequence of fully connected (fc) layers that finally branch into two sibling output layers: one that produces softmax probability estimates over K object classes plus a catch-all “background” class and another layer that outputs four real-valued numbers for each of the K object classes. Each set of 4 values encodes refined bounding-box positions for one of the K classes.
- Fast R-CNN架构
- 将整个图像和一组object proposals集合作为输入。
- 接着,将整个图像进行卷积、池化以产生一个Conv feature map。
- 根据输入的object proposals,使用RoI池化层,直接从Conv feature map中提取RoI feature vector。
- 最后,RoI feature vector进入全连接层(这时有两个并行的模块):
- softmax用于分类;
- regressor用于产生每个类的边界。
First, the last max pooling layer is replaced by a RoI pooling layer that is configured by setting H and W to be compatible with the net’s first fully connected layer (e.g., H = W = 7 for VGG16).
Second, the network’s last fully connected layer and softmax (which were trained for 1000-way ImageNet classification) are replaced with the two sibling layers described earlier (a fully connected layer and softmax over K+1 categories and category-specific bounding-box regressors).
Third, the network is modified to take two data inputs: a list of images and a list of RoIs in those images.
- R-CNN -> Fast R-CNN
- 最后一个最大池层被一个RoI池层代替。
- 最后完全连接层和softmax替换为两个并行模块。
- 网络被修改为需要两个数据输入:图像列表和这些图像中的RoI列表。
Training all network weights with back-propagation is an important capability of Fast R-CNN. First, let’s elucidate why SPPnet is unable to update weights below the spatial pyramid pooling layer. The root cause is that back-propagation through the SPP layer is highly inefficient when each training sample (i.e. RoI) comes from a different image, which is exactly how R-CNN and SPPnet networks are trained. The inefficiency stems from the fact that each RoI may have a very large receptive field, often spanning the entire input image. Since the forward pass must process the entire receptive field, the training inputs are large (often the entire image).
- SPP-Net为何无法更新金字塔池层前的卷积层
- 根本原因在于当训练样本来自不同图像时,反向传播效率特别低。
- 效率低下的原因在于每个RoI可能有一个非常大的接受域,通常跨越整个输入图像。
We propose a more efficient training method that takes advantage of feature sharing during training. In Fast R-CNN training, stochastic gradient descent (SGD) minibatches are sampled hierarchically, first by sampling N images and then by sampling R/N RoIs from each image. Critically, RoIs from the same image share computation and memory in the forward and backward passes. Making N small decreases mini-batch computation.
- Fast R-CNN解决反向传播效率低问题
- 特征共享。
- 分级采样。
References:
[1] Girshick R. Fast R-CNN[J]. Computer Science, 2015.
@qingdujun
2018-5-25 于 北京 怀柔