I need to fine-tune my word2vec model. I have two datasets, data1
and data2
.
What I did so far is:
model = gensim.models.Word
Is this correct?
Yes, it is. You need to make sure that data2's words in vocabulary provided by data1. If it isn't the words - that isn't presented in vocabulary - will be lost.
Note that the weights that will be computed by
model.train(data1, total_examples=len(data1), epochs=epochs)
and
model.train(data2, total_examples=len(data2), epochs=epochs)
isn't equal to
model.train(data1+data2, total_examples=len(data1+data2), epochs=epochs)
Do I need to store learned weights somewhere?
No, you don't need to.
But if you want you can save weights as a file so you can use them later.
model.save("word2vec.model")
And you load them by
model = Word2Vec.load("word2vec.model")
(source)
I need to fine tune my word2vec model.
Note that "Word2vec training is an unsupervised task, there’s no good way to objectively evaluate the result. Evaluation depends on your end application." (source) But there's some evaluations that you can look-up here ("How to measure quality of the word vectors" section)
Hope that helps!