In the paper \'Fully Convolutional Networks for Semantic Segmentation\' the author distinguishes between input stride and output stride in the context of deconvolution. How do
Input stride is the stride of the filter . How much you shift the filter in the output .
Output Stride this is actually a nominal value . We get feature map in a CNN after doing several convolution , max-pooling operations . Let's say our input image is 224 * 224 and our final feature map is 7*7 .
Then we say our output stride is : 224/7 = 32 (Approximate of what happened to the image after down sampling .)
This tensorflow script describe what is this output stride , and how to use in FCN which is the case of dense prediction .
one uses inputs with spatial dimensions that are multiples of 32 plus 1, e.g., [321, 321]. In this case the feature maps at the ResNet output will have spatial shape [(height - 1) / output_stride + 1, (width - 1) / output_stride + 1] and corners exactly aligned with the input image corners, which greatly facilitates alignment of the features to the image. Using as input [225, 225] images results in [8, 8] feature maps at the output of the last ResNet block.