CBOW v.s. skip-gram: why invert context and target words?

前端 未结 2 737
伪装坚强ぢ
伪装坚强ぢ 2021-01-29 20:32

In this page, it is said that:

[...] skip-gram inverts contexts and targets, and tries to predict each context word from its target word [...]

2条回答
  •  醉话见心
    2021-01-29 21:30

    Here is my oversimplified and rather naive understanding of the difference:

    As we know, CBOW is learning to predict the word by the context. Or maximize the probability of the target word by looking at the context. And this happens to be a problem for rare words. For example, given the context yesterday was a really [...] day CBOW model will tell you that most probably the word is beautiful or nice. Words like delightful will get much less attention of the model, because it is designed to predict the most probable word. This word will be smoothed over a lot of examples with more frequent words.

    On the other hand, the skip-gram model is designed to predict the context. Given the word delightful it must understand it and tell us that there is a huge probability that the context is yesterday was really [...] day, or some other relevant context. With skip-gram the word delightful will not try to compete with the word beautiful but instead, delightful+context pairs will be treated as new observations.

    UPDATE

    Thanks to @0xF for sharing this article

    According to Mikolov

    Skip-gram: works well with small amount of the training data, represents well even rare words or phrases.

    CBOW: several times faster to train than the skip-gram, slightly better accuracy for the frequent words

    One more addition to the subject is found here:

    In the "skip-gram" mode alternative to "CBOW", rather than averaging the context words, each is used as a pairwise training example. That is, in place of one CBOW example such as [predict 'ate' from average('The', 'cat', 'the', 'mouse')], the network is presented with four skip-gram examples [predict 'ate' from 'The'], [predict 'ate' from 'cat'], [predict 'ate' from 'the'], [predict 'ate' from 'mouse']. (The same random window-reduction occurs, so half the time that would just be two examples, of the nearest words.)

提交回复
热议问题