Position-wise Feed-Forward Networks (transformer)

In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically.

This consists of two linear transformations with a ReLU activation in between.

While the linear transformations are the same across different positions, they use different parameters from layer to layer.

Another way of describing this is as two convolutions with kernel size 1.

The dimensionality of input and output is dmodel=512, and the inner-layer has dimensionality dff = 2048.

import torch
import torch.nn as nn
import torch.nn.functional as F


class FeedForward(nn.Module):
    def __init__(self, d_in, d_hid,  dropout=0.1):
        super(FeedForward, self).__init__()
        self.w1 = nn.Linear(d_in, d_hid)
        self.w2 = nn.Linear(d_hid, d_in)
        self.layer_norm = nn.LayerNorm(d_in, eps=1e-6)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):  # x.shape: (batch, seqlen, d_in)
        residual = x

        x = self.w2(F.relu(self.w1(x)))
        x = self.dropout(x)
        x += residual

        x = self.layer_norm(x)
        return x

batch_size = 64
seq_len = 20
d_model = 512
dff = 2048
data = torch.randn(batch_size, seq_len, d_model)

model = FeedForward(d_model, dff)

print(data.shape)
print(model(data).shape)

来源：oschina

链接：https://my.oschina.net/u/4228078/blog/4514575

标签

layer

wise

DFF

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!