Adaptive pooling is a great function, but how does it work? It seems to be inserting pads or shrinking/expanding kernel sizes in what seems like a pattered but fairly arbit
In general, pooling reduces dimensions. If you want to increase dimensions, you might want to look at interpolation.
Anyway, let's talk about adaptive pooling in general. You can look at the source code here. Some claimed that adaptive pooling is the same as standard pooling with stride and kernel size calculated from input and output size. Specifically, the following parameters are used:
(input_size//output_size)
input_size - (output_size-1)*stride
0
These are inversely worked from the pooling formula. While they DO produce output of the desired size, its output is not necessarily the same as that of adaptive pooling. Here is a test snippet:
import torch
import torch.nn as nn
in_length = 5
out_length = 3
x = torch.arange(0, in_length).view(1, 1, -1).float()
print(x)
stride = (in_length//out_length)
avg_pool = nn.AvgPool1d(
stride=stride,
kernel_size=(in_length-(out_length-1)*stride),
padding=0,
)
adaptive_pool = nn.AdaptiveAvgPool1d(out_length)
print(avg_pool.stride, avg_pool.kernel_size)
y_avg = avg_pool(x)
y_ada = adaptive_pool(x)
print(y_avg)
print(y_ada)
Output:
tensor([[[0., 1., 2., 3., 4.]]])
(1,) (3,)
tensor([[[1., 2., 3.]]])
tensor([[[0.5000, 2.0000, 3.5000]]])
Error: 1.0
Average pooling pools from elements (0, 1, 2), (1, 2, 3) and (2, 3, 4).
Adaptive pooling pools from elements (0, 1), (1, 2, 3) and (3, 4). (Change the code a bit to see that it is not pooling from (2) only)
count_include_pad=True
, but in general I don't think they can be exactly the same for 2D or higher for all input/output sizes. I would imagine using different paddings for left/right. This is not supported in pooling layers for the moment.As hkchengrex's answer points out, the PyTorch documentation does not explain what rule is used by adaptive pooling layers to determine the size and locations of the pooling kernels. (In fact, there is a fixme in the PyTorch code indicating the documentation needs to be improved.)
However, the calculation of the kernel sizes and locations is implemented by this cpp function and the key logic is actually in the calls to the functions start_index
and end_index
, which define the location and offset of the kernels.
I believe this Python code re-implements that code and shows how kernels are calculated:
from typing import List
import math
def kernels(ind,outd) -> List:
"""Returns a List [(kernel_offset_start,kernel_length)] defining all the pooling kernels for a 1-D adaptive pooling layer that takes an input of dimension `ind` and yields an output of dimension `outd`"""
def start_index(a,b,c):
return math.floor((float(a) * float(c)) / b)
def end_index(a,b,c):
return math.ceil((float(a + 1) * float(c)) / b)
results = []
for ow in range(outd):
start = start_index(ow,outd,ind)
end = end_index(ow,outd,ind)
sz = end - start
results.append((start,sz))
return results
def kernel_indexes(ind,out) -> List:
"""Returns a List [[*ind]] containing the indexes of the pooling kernels"""
startsLengths = kernels(ind,out)
return [list(range(start,start+length)) for (start,length) in startsLengths]
Here are the key points to notice.
First, it matters a lot whether the input dimension (ind
) is an integer multiple of the output dimension (outd
).
Second, when this is the case, then the adaptive layer's kernels are equally-sized and non-overlapping, and are exactly what would be produced by defining kernels and a stride based on the following rule:
stride = ind // outd
kernel_size = ind - (outd-1)*stride
padding = 0
In other words, in this case it is possible to reproduce the effect of an adaptive pooling layer by using a non-adaptive pooling layer defined with suitable stride, kernel_size, and padding. (Example further below.)
Finally, when instead it is the case that the input size is not an integer multiple of the output size, then PyTorch's adaptive pooling rule produces kernels which overlap and are of variable size.
Since the non-adaptive pooling API does not allow for variably-sized kernels, in this case it seems to me there is no way to reproduce the effect of adaptive pooling by feeding suitable values into a non-adaptive pooling layer.
Here's an example which shows both cases. This helper function lets us compare what's happening with adapative average pooling layer and an ordinary average pooling layer which uses fixed stride and kernel:
import torch
import torch.nn as nn
def compare1DAdaptivity(ind,outd,inputpattern):
c = 1
padding = 0
input = torch.Tensor(inputpattern).view(1,c,ind)
stride = ind // outd
kernel_size = (ind - (outd-1)*stride)
avg_pool = nn.AvgPool1d(stride=stride,kernel_size=kernel_size,padding=padding)
avg_out = avg_pool(input)
adap_avg_pool = torch.nn.AdaptiveAvgPool1d(outd)
adap_avg_out = adap_avg_pool(input)
try:
equal_output = torch.allclose(avg_out,adap_avg_out)
except:
equal_output = False
print("input.shape: {}".format(input.shape))
print("in_dims: {}".format(ind))
print("out_dims: {}".format(outd))
print("")
print("AAL strides: {}".format(stride))
print("AAL kernel_sizes: {}".format(kernel_size))
print("AAL pad: {}".format(padding))
print("")
print("outputs equal: {}".format(equal_output))
print("")
print("AAL input -> output: {} -> {}".format(input,avg_out))
print("adap input -> output: {} -> {}".format(input,adap_avg_out))
return equal_output
So, to give an example of the first case, where the input dimension is a multiple of the output dimension, we can go from 6 to 3. We can see that the approximate adaptive layer and the true adaptive layer give the same output:
compare1DAdaptivity(6,3,[1,0,0,0,0]) # => Tue
AAL input -> output: tensor([[[1., 0., 0., 0., 0., 0.]]]) -> tensor([[[0.5000, 0.0000, 0.0000]]])
adap input -> output: tensor([[[1., 0., 0., 0., 0., 0.]]]) -> tensor([[[0.5000, 0.0000, 0.0000]]])
However, this no longer works if we go from 5 to 3.
compare1DAdaptivity(5,3,[1,0,0,0,0]) # => False
AAL input -> output: tensor([[[1., 0., 0., 0., 0.]]]) -> tensor([[[0.3333, 0.0000, 0.0000]]])
adap input -> output: tensor([[[1., 0., 0., 0., 0.]]]) -> tensor([[[0.5000, 0.0000, 0.0000]]])
But we can reproduce the result of the adaptive layers by manually computing over the indexes:
t = [1,0,0,0,0]; [sum( [t[x] for x in xs] ) / len(xs) for xs in kernel_indexes(5,3)]
# => [0.5,0.0,0.0]