How is the self-attention mechanism in Transformers able to learn how the words are related to each other?
问题 Given the sentence The animal didn't cross the street because it was too tired , how the self-attention is able to map with a higher score the word aninal intead of the word street ? I'm wondering if that might be a consequence of the word embedding vectors fed into the network, that some how already encapsulate some degree of distance among the words. 来源: https://stackoverflow.com/questions/58855564/how-is-the-self-attention-mechanism-in-transformers-able-to-learn-how-the-words