I am reading through Residual learning, and I have a question. What is "linear projection" mentioned in 3.2? Looks pretty simple once got this but could not get th
A linear projection is one where each new feature is simple a weighted sum of the original features. As in the paper, this can be represented by matrix multiplication. if x
is the vector of N
input features and W
is an M
-byN
matrix, then the matrix product Wx
yields M
new features where each one is a linear projection of x
. Each row of W
is a set of weights that defines one of the M
linear projections (i.e., each row of W
contains the coefficients for one of the weighted sums of x
).
First up, it's important to understand what x
, y
and F
are and why they need any projection at all. I'll try explain in simple terms, but basic understanding of ConvNets is required.
x
is an input data (called tensor) of the layer, in case of ConvNets it's rank is 4. You can think of it as a 4-dimensional array. F
is usually a conv layer (conv+relu+batchnorm
in this paper), and y
combines the two together (forming the output channel). The result of F
is also of rank 4, and most of dimensions will be the same as in x
, except for one. That's exactly what the transformation should patch.
For example, x
shape might be (64, 32, 32, 3)
, where 64 is the batch size, 32x32 is image size and 3 stands for (R, G, B) color channels. F(x)
might be (64, 32, 32, 16)
: batch size never changes, for simplicity, ResNet conv-layer doesn't change the image size too, but will likely use a different number of filters - 16.
So, in order for y=F(x)+x
to be a valid operation, x
must be "reshaped" from (64, 32, 32, 3)
to (64, 32, 32, 16)
.
I'd like to stress here that "reshaping" here is not what numpy.reshape
does.
Instead, x[3]
is padded with 13 zeros, like this:
pad(x=[1, 2, 3],padding=[7, 6]) = [0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 0, 0, 0, 0, 0, 0]
If you think about it, this is a projection of a 3-dimensional vector onto 16 dimensions. In other words, we start to think that our vector is the same, but there are 13 more dimensions out there. None of the other x
dimensions are changed.
Here's the link to the code in Tensorflow that does this.
In Pytorch (in particular torchvision\models\resnet.py), at the end of a Bottleneck you will either have two scenarios
The input vector x's channels, say x_c (not spatial resolution, but channels), are less than equal to the output after layer conv3 of the Bottleneck, say d dimensions. This can then be alleviated by a 1 by 1 convolution with in planes = x_c and out_planes = d, with stride 1, followed by batch normalization, and then the addition F(x) + x occurs assuming x and F(x) have the same spatial resolution.
Both the spatial resolution of x and its number of channels don't match the output of the BottleNeck layer, in which case the 1 by 1 convolution mentioned above needs to have stride 2 in order for both the spatial resolution and the number of channels to match for the element-wise addition (again with batch normalization of x before the addition).