How does input image size influence size and shape of fully connected layer?

前端 未结 2 422
遇见更好的自我
遇见更好的自我 2021-01-28 04:50

I am reading a lot of tutorials that state two things.

  1. \"[Replacing fully connected layers with convolutional layers] casts them into fully convolutional networks
2条回答
  •  陌清茗
    陌清茗 (楼主)
    2021-01-28 05:13

    It seems like you are confusion spatial dimensions (height and width) of an image/feature map, and the "channel dimension" which is the dimension of the information stored per pixel.

    An input image can have arbitrary height and width, but will always have a fixed "channel" dimension = 3; That is, each pixel has a fixed dimension of 3, which are the RGB values of the color of each pixel.
    Let's denote the input shape as 3xHxW (3 RGB channels, by height H by width W).

    Applying a convolution with kernel_size=5 and output_channel=64, means that you have 64 filters of size 3x5x5. For each filter you take all overlapping 3x5x5 windows in the image (RGB by 5 by 5 pixels) and output a single number per filter which is the weighted sum of the input RGB values. Doing so for all 64 filters will give you 64 channels per sliding window, or an output feature map of shape 64x(H-4)x(W-4).

    Additional convolution layer with, say kernel_size=3 and output_channels=128 will have 128 filters of shape 64x3x3 applied to all 3x3 sliding windows in the input feature map os shape 64x(H-4)x(W-4) resulting with an output feature map of shape 128x(H-6)x(W-6).

    You can continue in a similar way with additional convolution and even pooling layers.
    This post has a very good explanation on how convolution/pooling layers affect the shapes of the feature maps.

    To recap, as long as you do not change the number of input channels, you can apply a fully convolutional net to images of arbitrary spatial dimensions, resulting with different spatial shapes of the output feature maps, but always with the same number of channels.

    As for a fully connected (aka inner-product/linear) layer; this layer does not care about spatial dimensions or channel dimensions. The input to a fully connected layer is "flattened" and then the number of weights are determined by the number of input elements (channel and spatial combined) and the number of outputs.
    For instance, in a VGG network, when training on 3x224x224 images, the last convolution layer outputs a feature map of shape 512x7x7 which is than flattened to a 25,088 dimensional vector and fed into a fully connected layer with 4,096 outputs.

    If you were to feed VGG with input images of different spatial dimensions, say 3x256x256, your last convolution layer will output a feature map of shape 512x8x8 -- note how the channel dimension, 512, did not change, but the spatial dimensions grew from 7x7 to 8x8. Now, if you were to "flatten" this feature map you will have a 32,768 dimensional input vector for your fully connected layer, but alas, your fully connected layer expects a 25,088 dimensional input: You will get a RunTimeError.

    If you were to convert your fully connected layer to a convolutional layer with kernel_size=7 and output_channels=4096 it will do exactly the same mathematical operation on the 512x7x7 input feature map, to produce a 4096x1x1 output feature.
    However, when you feed it a 512x8x8 feature map it will not produce an error, but rather output a 4096x2x2 output feature map - spatial dimensions adjusted, number of channels fixed.

提交回复
热议问题