I\'m looking at InceptionV3 (GoogLeNet) architecture and cannot understand why do we need conv1x1 layers?
I know how convolution works, but I see a profit with patch siz
A 1x1 convolution simply maps in input pixel to an output pixel, not looking at anything around itself. It is often used to reduce the number of depth channels, since it is often very slow to multiply volumes with extremely large depths.
input (256 depth) -> 1x1 convolution (64 depth) -> 4x4 convolution (256 depth)
input (256 depth) -> 4x4 convolution (256 depth)
The bottom one is about ~3.7x slower.
Theoretically the neural network can 'choose' which input 'colors' to look at using this, instead of brute force multiplying everything.