Transformer, ELMo, GPT, 到Bert

RNN：难以并行

CNN：filter只能考虑局部的信息，要叠多层

Self-attention：可以考虑全局的信息，并且可以并行（Attention Is All You Need）

示意图：x¹, x², x³, x⁴先embedding成a¹, a², a³, a⁴,然后输入到Self-Attention Layer输出 𝑏¹, 𝑏², 𝑏³, 𝑏⁴, ps:它们能够平行计算

下面我们来看看如何计算b¹

先通过W_q, W_k, W_v将aⁱ变成（qⁱ, kⁱ, vⁱ），ps:三个矩阵乘的都是aⁱ，这就是为什么叫self-attention

计算b¹的过程

整个过程的示意图（省略了scale的部分）

还有一些操作：

1. Multi-head Self-attention： MultiHead(Q,K,V)=Concat(head₁,…,head_h)W_O,where head_i=Attention(QW_i^Q,KW_i^K,VW_i^V)

2. Positional Encoding (原始paper是人工设计的，不是训练出来的)

Transformer

Transformer示意图

Decoder 部分

默认参数：

可以查看google blog，transformer的大致工作的gif示意图.

ELMo

对数似然函数：

Θ_x：token representation的参数， Θ_s： Softmax layer参数

假设我们使用的是L层的双向LSTM，那么，对于每一个token t_k, 我们可以得到2L+1个向量表示

ELMo具体操作是将这2L+1个向量进行加权平均，权重是学出来的，针对不同的下游任务，

s^task：softmax-normalized weights， γ^task：the scalar parameter -- allows the task model to scale the entire ELMo vector

Bert

BERT用了双向的Transformer的encoder，而GPT是单向的decoder

BERT用了Masked Language model和Next Sentence Prediction(NSP), 并且可以很好的对下游任务进行fine-tune，一般只需要在额外加一层output layer就可以得到非常好的结果.

对于不同的task, Fine-tune BERT

下面讲一下具体的一些细节。

Masked LM

对于每一句话，mask掉 15%的token, 对于这部分mask掉的token，

80%被替换成[MASK]， my dog is hairy → my dog is [MASK]
10%替换成random token (根据unigram distribution)， my dog is hairy → my dog is apple
10% 不变， my dog is hairy → my dog is hairy

然后我们只预测masked words。

Next Sentence Prediction

例子

Input = [CLS] the man went to [MASK] store [SEP] he bought a gallon [MASK] milk [SEP]
Label = IsNext

Input = [CLS] the man [MASK] to the store [SEP] penguin [MASK] are flight ##less birds [SEP]
Label = NotNext

最近，还有一些新的版本的Bert。

来源：https://www.cnblogs.com/skykill/p/11974547.html

标签