Why is attention using multiplication instead of substraction?

前端未结

关注

 0  884

Attention is trying to get close words by multiplying the embedding vectors. And when the product is great the embedding vectors are close, so the token have similarities. But w