Word2Vec: Number of Dimensions

后端未结

关注

 3  735

再見小時候 2020-12-28 17:02

I am using Word2Vec with a dataset of roughly 11,000,000 tokens looking to do both word similarity (as part of synonym extraction for a downstream task) but I don\'t have a

3条回答

囚心锁ツ (楼主)

2020-12-28 17:29

Typical interval is between 100-300. I would say you need at least 50D to achieve lowest accuracy. If you pick lesser number of dimensions, you will start to lose properties of high dimensional spaces. If training time is not a big deal for your application, i would stick with 200D dimensions as it gives nice features. Extreme accuracy can be obtained with 300D. After 300D word features won't improve dramatically, and training will be extremely slow.

I do not know theoretical explanation and strict bounds of dimension selection in high dimensional spaces (and there might not a application-independent explanation for that), but I would refer you to Pennington et. al, Figure2a where x axis shows vector dimension and y axis shows the accuracy obtained. That should provide empirical justification to above argument.

0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...