Transformer_Introduce | 易学教程

1. Embedding

After embedding the words in our input sequence, each of them flows through each of the two layers of the encoder.

The word in each position flows through its own path in the encoder. There are dependencies between these paths in the self-attention layer. The feed-forward layer does not have those dependencies, however, and thus the various paths can be executed in parallel while flowing through the feed-forward layer.

2. Encode

an encoder receives a list of vectors as input. It processes this list by passing these vectors into a ‘self-attention’ layer, then into a feed-forward neural network, then sends out the output upwards to the next encoder.

Self-Attention
- create vectors from each of the encoder’s input vectors (in this case, the embedding of each word). For each word, we create a Query vector, a Key vector, and a Value vector.
- calculating self-attention is to calculate a score
Say we’re calculating the self-attention for the first word in this example, “Thinking”. We need to score each word of the input sentence against this word. The score determines how much focus to place on other parts of the input sentence as we encode a word at a certain position.
- third and forth steps are to divide the scores by 8 (the square root of the dimension of the key vectors used in the paper – 64. This leads to having more stable gradients. There could be other possible values here, but this is the default), then pass the result through a softmax operation. Softmax normalizes the scores so they’re all positive and add up to 1.
- fifth step is to multiply each value vector by the softmax score (in preparation to sum them up). The intuition here is to keep intact the values of the word(s) we want to focus on, and drown-out irrelevant words (by multiplying them by tiny numbers like 0.001, for example).
- sixth step is to sum up the weighted value vectors. This produces the output of the self-attention layer at this position (for the first word).

3. Matrix Calculation of Self-Attention

the first step is to calculate the Query, Key, and Value matrices. We do that by packing our embeddings into a matrix X, and multiplying it by the weight matrices we’ve trained (WQ, WK, WV).

Every row in the X matrix corresponds to a word in the input sentence. We again see the difference in size of the embedding vector (512, or 4 boxes in the figure), and the q/k/v vectors (64, or 3 boxes in the figure)

condense steps two through six in one formula to calculate the outputs of the self-attention layer.

The self-attention calculation in matrix form

RNNs maintain a hidden state allows an RNN to incorporate its representation of previous words/vectors it has processed with the current one it’s processing. Self-attention is the method the Transformer uses to bake the “understanding” of other relevant words into the one we’re currently processing.

4. The Beast With Many Heads

“multi-headed” attention

It expands the model’s ability to focus on different positions. Yes, in the example above, z1 contains a little bit of every other encoding, but it could be dominated by the the actual word itself. It would be useful if we’re translating a sentence like “The animal didn’t cross the street because it was too tired”, we would want to know which word “it” refers to.

It gives the attention layer multiple “representation subspaces”. As we’ll see next, with multi-headed attention we have not only one, but multiple sets of Query/Key/Value weight matrices (the Transformer uses eight attention heads, so we end up with eight sets for each encoder/decoder). Each of these sets is randomly initialized. Then, after training, each set is used to project the input embeddings (or vectors from lower encoders/decoders) into a different representation subspace.

If we do the same self-attention calculation we outlined above, just eight different times with different weight matrices, we end up with eight different Z matrices

concat the matrices then multiple them by an additional weights matrix WO.

Multi-Headed self-attention visualization

As we encode the word “it”, one attention head is focusing most on “the animal”, while another is focusing on “tired” – in a sense, the model’s representation of the word “it” bakes in some of the representation of both “animal” and “tired”.

5. Representing The Order of The Sequence Using Positional Encoding

helps it determine the position of each word, or the distance between different words in the sequence.

In the following figure, each row corresponds the a positional encoding of a vector. So the first row would be the vector we’d add to the embedding of the first word in an input sequence. Each row contains 512 values – each with a value between 1 and -1. We’ve color-coded them so the pattern is visible.

6. Residuals

Transformer of 2 stacked encoders and decoders

7. Decoder Side

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-42w4kU4q-1594815182872)(https://gitee.com/github-25970295/blogImage/raw/master/img/transformer_decoding_1.gif)]

The following steps repeat the process until a special symbol is reached indicating the transformer decoder has completed its output. The output of each step is fed to the bottom decoder in the next time step, and the decoders bubble up their decoding results just like the encoders did. And just like we did with the encoder inputs, we embed and add positional encoding to those decoder inputs to indicate the position of each word.

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-7lWnjE1N-1594815182873)(https://gitee.com/github-25970295/blogImage/raw/master/img/transformer_decoding_2.gif)]

8. Final Linear and Softmax Layer

The decoder stack outputs a vector of floats.The Linear layer is a simple fully connected neural network that projects the vector produced by the stack of decoders, into a much, much larger vector called a logits vector.

9.Go Forth And Transform

I hope you’ve found this a useful place to start to break the ice with the major concepts of the Transformer. If you want to go deeper, I’d suggest these next steps: