问题
I'm currently trying to modify the VGG16 network architecture so that it's able to accept 400x400 px images.
Based on literature that I've read, the way to do it would be to covert the fully connected (FC) layers into convolutional (CONV) layers. This would essentially " allow the network to efficiently “slide” across a larger input image and make multiple evaluations of different parts of the image, incorporating all available contextual information." Afterwards, an Average Pooling layer is used to "average the multiple feature vectors into a single feature vector that summarizes the input image".
I've done this using this function, and have come up with the following network architecture:
----------------------------------------------------------------
Layer (type) Output Shape Param #
================================================================
Conv2d-1 [-1, 64, 400, 400] 1,792
ReLU-2 [-1, 64, 400, 400] 0
Conv2d-3 [-1, 64, 400, 400] 36,928
ReLU-4 [-1, 64, 400, 400] 0
MaxPool2d-5 [-1, 64, 200, 200] 0
Conv2d-6 [-1, 128, 200, 200] 73,856
ReLU-7 [-1, 128, 200, 200] 0
Conv2d-8 [-1, 128, 200, 200] 147,584
ReLU-9 [-1, 128, 200, 200] 0
MaxPool2d-10 [-1, 128, 100, 100] 0
Conv2d-11 [-1, 256, 100, 100] 295,168
ReLU-12 [-1, 256, 100, 100] 0
Conv2d-13 [-1, 256, 100, 100] 590,080
ReLU-14 [-1, 256, 100, 100] 0
Conv2d-15 [-1, 256, 100, 100] 590,080
ReLU-16 [-1, 256, 100, 100] 0
MaxPool2d-17 [-1, 256, 50, 50] 0
Conv2d-18 [-1, 512, 50, 50] 1,180,160
ReLU-19 [-1, 512, 50, 50] 0
Conv2d-20 [-1, 512, 50, 50] 2,359,808
ReLU-21 [-1, 512, 50, 50] 0
Conv2d-22 [-1, 512, 50, 50] 2,359,808
ReLU-23 [-1, 512, 50, 50] 0
MaxPool2d-24 [-1, 512, 25, 25] 0
Conv2d-25 [-1, 512, 25, 25] 2,359,808
ReLU-26 [-1, 512, 25, 25] 0
Conv2d-27 [-1, 512, 25, 25] 2,359,808
ReLU-28 [-1, 512, 25, 25] 0
Conv2d-29 [-1, 512, 25, 25] 2,359,808
ReLU-30 [-1, 512, 25, 25] 0
MaxPool2d-31 [-1, 512, 12, 12] 0
Conv2d-32 [-1, 4096, 1, 1] 301,993,984
ReLU-33 [-1, 4096, 1, 1] 0
Dropout-34 [-1, 4096, 1, 1] 0
Conv2d-35 [-1, 4096, 1, 1] 16,781,312
ReLU-36 [-1, 4096, 1, 1] 0
Dropout-37 [-1, 4096, 1, 1] 0
Conv2d-38 [-1, 3, 1, 1] 12,291
AdaptiveAvgPool2d-39 [-1, 3, 1, 1] 0
Softmax-40 [-1, 3, 1, 1] 0
================================================================
Total params: 333,502,275
Trainable params: 318,787,587
Non-trainable params: 14,714,688
----------------------------------------------------------------
Input size (MB): 1.83
Forward/backward pass size (MB): 696.55
Params size (MB): 1272.21
Estimated Total Size (MB): 1970.59
----------------------------------------------------------------
My question is simple: Is the use of the average pooling layer at the end necessary? It seems like by the last convolutional layer, we get a 1x1 image with 3 channels. Doing an average pooling on that would seem to not have any effect.
If there is anything amiss in my logic/ architecture, kindly feel free to point it out. Thanks!
回答1:
How to convert VGG to except input size of 400 x 400 ?
First Approach
The problem with VGG
style architecture is we are hardcoding the number of input & output features in our Linear Layers.
i.e
vgg.classifier[0]: Linear(in_features=25088, out_features=4096, bias=True)
It is expecting 25,088 input features.
If we pass an image of size (3, 224, 224)
through vgg.features
the output feature map will be of dimensions:
(512, 7, 7) => 512 * 7 * 7 => 25,088
If we change the input image size to (3, 400, 400)
and pass
through vgg.features
the output feature map will be of dimensions:
(512, 12, 12) => 512 * 12 * 12 => 73,728
throws `sizemismatch` error.
One way to fix this issue is by using nn.AdaptiveAvgPool
in place of nn.AvgPool
. AdaptiveAvgPool helps to define the output size of the layer which remains constant irrespective of the size of the input through the vgg.features
layer.
for eg:
vgg.features[30] = nn.AdaptiveAvgPool(output_size=(7,7))
will make sure the final feature maps have a dimension of `(512, 7, 7)`
irrespective of the input size.
You can read more about Adaptive Pooling in here.
Second Approach
If you use the technique here to convert your Linear layers to Convolutional Layers, you don't have to worry about the input dimension, however you have to change the weight initialisation techniques because of the change in number of parameters.
Is the use of the average pooling layer at the end necessary?
No, in this case. It is not changing the size of the input feature map, hence it not doing an average over a set of nodes.
回答2:
Purpose of AdaptiveAvgPool2d
is to make the convnet work on input of any arbitrary size (and produce an output of fixed size). In your case, since input size is fixed to 400x400, you probably do not need it.
I think this paper might give you a better idea of this method - https://arxiv.org/pdf/1406.4729v3.pdf
来源:https://stackoverflow.com/questions/53114882/pytorch-modifying-vgg16-architecture