问题
If I calculate word2vec for the same word (say, "monkey"), one time on the basis of one large text from the year 1800 and another time on the basis of one large text from the year 2000, then the results would not be comparable from my point of view. Am I right? And why is it so? I have the following idea: the text from the past may have complete different vocabulary, which is the problem. But how one can then cure it (make embeddings comparable)?
Thanks in advance.
回答1:
There's no "right" position for any word in a Word2Vec
model – just a position that works fairly well, in relation to other words and the training data, after a bunch of the pushes-and-pulls of the incremental training. Indeed, every model starts with word-vectors in low-magnitude random positions, and the training itself includes both designed-in randomness (such as via random choice of which words to use as negative contrastive examples) and execution-order randomness (as multiple threads make progress at slightly-different rates due to the operating system's somewhat-arbitrary CPU-scheduling choices).
So, your "sentences-from-1800" and "sentences-from-2000" models will differ because the training data is different – likely from both the fact that authors' usage varied, and that each corpus is just a tiny sample of all existing usage. But also: just training on the "samples-from-1800" corpus twice in a row will result in different models! Each such model should be about-as-good as the other, in terms of the relative distances/positions of words with respect to other words in the same model. But the coordinates of individual words could be very different, and non-comparable.
In order for words to be "in the same coordinate space", extra steps must be taken. The most direct way for words to be in the same space is for them to be trained together in the same model, with them appearing alternately in contrasting examples of usage, including with other common words.
So if for example you needed to compare 'calenture' (an old word for tropical fevers which might not appear in your 2000s corpus) to 'penicillin' (which was discovered in the 20th century), your best bet would be to shuffle together the two corpuses into a single corpus and train a single model. To the extent each word appeared near certain words that appeared in both eras, with relatively stable meaning, their word-vectors might then be comparable.
If you only need one combined word-vector for 'monkey', this approach may be fine your purposes, as well. Yes, a word's meaning drifts over time. But even at any single point in time, words are polysemous: they have multiple meanings. And word-vectors for words with many meanings tend to move to coordinates between each of their alternate meanings. So even if 'monkey' has drifted in meaning, it is still the case that using a combined-eras corpus would probably give you a single vector for 'monkey' that reasonably represents its average meaning over all eras.
If you specifically wanted to model words' changes-in-meaning over time, then you might need other approaches:
You might want to build separate models for eras, but learn translations between them, based on the idea that some words may change-little while others change-lots. (There are ways to use certain "anchor words", assumed to have the same meaning, to learn a transformation between separate
Word2Vec
models, then apply that same transformation to other words to project their coordinates in another model.)Or, make a combined model, but probabilistically replace words whose changing-meanings you'd like to track with era-specific alternate tokens. (For example, you might replace some proportion of 'monkey' occurrences with 'monkey@1800' and 'monkey@2000', as appropriate, so that in the end you get three word-vectors for 'monkey', 'monkey@1800', 'monkey@2000', allowing you to compare the different senses.)
Some prior work on tracking meanings-over-time using word-vectors is the 'HistWords' project:
https://nlp.stanford.edu/projects/histwords/
来源:https://stackoverflow.com/questions/57392103/word-embeddings-for-the-same-word-from-two-different-texts