I have read the documentation about the group param:
group (g) [default 1]: If g > 1, we restrict the connectivity of each filter to a su
First of all, Caffe only definite the behave while group
is multiple of both input_channel
and output_channel
. We can confirm this from the source code:
CHECK_EQ(channels_ % group_, 0);
CHECK_EQ(num_output_ % group_, 0)
<< "Number of output should be multiples of group.";
Secondly, the parameter group
is related to the number of filter paramters, specifically, to the channel size of filter.
The actual number of each filter is input_channel/group
. This could also be confirmed from the source code:
vector weight_shape(2);
weight_shape[0] = conv_out_channels_;
weight_shape[1] = conv_in_channels_ / group_;
Note here that weight_shape[0]
is the number of filer.
in Caffe, if the input_channel
is 40 and the group
is 20:
output_channel
may not be 50.output_channel
is 20 (remember it means you have 20 filters), each 2 input channels take charge of one output channel. For example, the 0th output channel is computed from the 0th and 1th input channels and has no relationship with others input channels.output_channel
equals to input_channel
(i.e.output_channel
= 40), this is actually the well-known depthwise convolution
. Each output channel is computed from only one different input channel.We almost always set group = output_channels
. Here is the suggested config for Deconvolution
layer from the official doc:
layer {
name: "upsample", type: "Deconvolution"
bottom: "{{bottom_name}}" top: "{{top_name}}"
convolution_param {
kernel_size: {{2 * factor - factor % 2}} stride: {{factor}}
num_output: {{C}} group: {{C}}
pad: {{ceil((factor - 1) / 2.)}}
weight_filler: { type: "bilinear" } bias_term: false
}
param { lr_mult: 0 decay_mult: 0 }
}
with the followed instruction:
By specifying num_output: {{C}} group: {{C}}, it behaves as channel-wise convolution. The filter shape of this deconvolution layer will be (C, 1, K, K) where K is kernel_size, and this filler will set a (K, K) interpolation kernel for every channel of the filter identically. The resulting shape of the top feature map will be (B, C, factor * H, factor * W). Note that the learning rate and the weight decay are set to 0 in order to keep coefficient values of bilinear interpolation unchanged during training.