I have been reading the papers on Word2Vec (e.g. this one), and I think I understand training the vectors to maximize the probability of other words found in the same contexts.<
Cosine similarity of two n-dimensional vectors A and B is defined as:
which simply is the cosine of the angle between A and B.
while the Euclidean distance is defined as
Now think about the distance of two random elements of the vector space. For the cosine distance, the maximum distance is 1 as the range of cos is [-1, 1].
However, for the euclidean distance this can be any non-negative value.
When the dimension n gets bigger, two randomly chosen points have a cosine distance which gets closer and closer to 90°, whereas points in the unit-cube of R^n have an euclidean distance of roughly 0.41 (n)^0.5 (source)
cosine distance is better for vectors in a high-dimensional space because of the curse of dimensionality. (I'm not absolutely sure about it, though)