I have read the documentation about the group param:
group (g) [default 1]: If g > 1, we restrict the connectivity of each filter to a su
First of all, Caffe only definite the behave while group
is multiple of both input_channel
and output_channel
. We can confirm this from the source code:
CHECK_EQ(channels_ % group_, 0);
CHECK_EQ(num_output_ % group_, 0)
<< "Number of output should be multiples of group.";
Secondly, the parameter group
is related to the number of filter paramters, specifically, to the channel size of filter.
The actual number of each filter is input_channel/group
. This could also be confirmed from the source code:
vector<int> weight_shape(2);
weight_shape[0] = conv_out_channels_;
weight_shape[1] = conv_in_channels_ / group_;
Note here that weight_shape[0]
is the number of filer.
in Caffe, if the input_channel
is 40 and the group
is 20:
output_channel
may not be 50.output_channel
is 20 (remember it means you have 20 filters), each 2 input channels take charge of one output channel. For example, the 0th output channel is computed from the 0th and 1th input channels and has no relationship with others input channels.output_channel
equals to input_channel
(i.e.output_channel
= 40), this is actually the well-known depthwise convolution
. Each output channel is computed from only one different input channel.We almost always set group = output_channels
. Here is the suggested config for Deconvolution
layer from the official doc:
layer {
name: "upsample", type: "Deconvolution"
bottom: "{{bottom_name}}" top: "{{top_name}}"
convolution_param {
kernel_size: {{2 * factor - factor % 2}} stride: {{factor}}
num_output: {{C}} group: {{C}}
pad: {{ceil((factor - 1) / 2.)}}
weight_filler: { type: "bilinear" } bias_term: false
}
param { lr_mult: 0 decay_mult: 0 }
}
with the followed instruction:
By specifying num_output: {{C}} group: {{C}}, it behaves as channel-wise convolution. The filter shape of this deconvolution layer will be (C, 1, K, K) where K is kernel_size, and this filler will set a (K, K) interpolation kernel for every channel of the filter identically. The resulting shape of the top feature map will be (B, C, factor * H, factor * W). Note that the learning rate and the weight decay are set to 0 in order to keep coefficient values of bilinear interpolation unchanged during training.
The argument gives the quantity of groups, not the size. If you have 40 inputs and set g to 20, you'll get 20 "lanes" of 2 channels each; with 50 outputs, you'd get 10 groups of 2 and 10 groups of 3.
More often, you split into a small number of groups, such as 2. In that case, you'd have two processing "lanes" or groups. For the 40=>50 layer you mention, each group would have 20 inputs and 25 outputs. Each layer will split in half, with each set of forward and backward propagation working only within its own half, for the range of layers over which the group parameter applies (I think it's all the way to the final layer).
The processing advantage is that instead of 40^2 input connections, you have 2 groups of 20^2 connections, or half as many. This accelerates the processing by roughly 2x, with a very small loss in convergence progress.
And secondly, why would I use [grouping]?
This was originally presented as an optimization in the paper which sparked the current cycle of neural network popularity :
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." In Advances in neural information processing systems, pp. 1097-1105. 2012.
Figure 2 shows how grouping was used for that work. The authors of caffe originally added this ability so they could replicate the AlexNet architecture. However grouping continues to show itself as beneficial in other scenarios.
For example both Facebook and Google have released papers which essentially show that grouping can dramatically reduce resource use while helping to preserve accuracy. The Facebook paper can be seen here:(ResNeXt) and the Google paper can be found here: (MobileNets)