In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically.
This consists of two linear transformations with a ReLU activation in between.
While the linear transformations are the same across different positions, they use different parameters from layer to layer.
Another way of describing this is as two convolutions with kernel size 1.
The dimensionality of input and output is dmodel=512, and the inner-layer has dimensionality dff = 2048.
import torch
import torch.nn as nn
import torch.nn.functional as F
class FeedForward(nn.Module):
def __init__(self, d_in, d_hid, dropout=0.1):
super(FeedForward, self).__init__()
self.w1 = nn.Linear(d_in, d_hid)
self.w2 = nn.Linear(d_hid, d_in)
self.layer_norm = nn.LayerNorm(d_in, eps=1e-6)
self.dropout = nn.Dropout(dropout)
def forward(self, x): # x.shape: (batch, seqlen, d_in)
residual = x
x = self.w2(F.relu(self.w1(x)))
x = self.dropout(x)
x += residual
x = self.layer_norm(x)
return x
batch_size = 64
seq_len = 20
d_model = 512
dff = 2048
data = torch.randn(batch_size, seq_len, d_model)
model = FeedForward(d_model, dff)
print(data.shape)
print(model(data).shape)
来源:oschina
链接:https://my.oschina.net/u/4228078/blog/4514575