Why Multi-Head Self-Attention Mechanism can learns more features than One-Head do?

前端未结

关注

 0  1255

Recently I am studying the Transformer mechanism, which contains a mechanism termed Multi-Head Self-Attention Mechanism. After browsing many blogs about how it works in detail,