Attention is trying to get close words by multiplying the embedding vectors. And when the product is great the embedding vectors are close, so the token have similarities. But w