I have been reading the papers on Word2Vec (e.g. this one), and I think I understand training the vectors to maximize the probability of other words found in the same contexts.<
Those two distance metrics are probably strongly correlated so it might not matter all that much which one you use. As you point out, cosine distance means we don't have to worry about the length of the vectors at all.
This paper indicates that there is a relationship between the frequency of the word and the length of the word2vec vector. http://arxiv.org/pdf/1508.02297v1.pdf