问题
Given the sentence The animal didn't cross the street because it was too tired, how the self-attention is able to map with a higher score the word aninal intead of the word street ?
I'm wondering if that might be a consequence of the word embedding vectors fed into the network, that some how already encapsulate some degree of distance among the words.
来源:https://stackoverflow.com/questions/58855564/how-is-the-self-attention-mechanism-in-transformers-able-to-learn-how-the-words