I need to train a word2vec representation on tweets using gensim. Unlike most tutorials and code I\'ve seen on gensim my data is not raw, but has already been preprocessed. I ha
The normal way to initialize a Word2Vec
model in gensim
is [1]
model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)
The question is, what is sentences
? sentences
is supposed to be an iterator of iterables of words/tokens. It is just like the numpy matrix you have, but each row can be of different lengths.
If you look at the documentation for gensim.models.word2vec.LineSentence
, it gives you a way of loading a text files as sentences directly. As a hint, according to the documentation, it takes
one sentence = one line; words already preprocessed and separated by whitespace.
When it says words already preprocessed
, it is referring to lower-casing, stemming, stopword filtering and all other text cleansing processes. In your case you wouldn't want 5
and 6
to be in your list of sentences, so you do need to filter them out.
Given that you already have the numpy matrix, assuming each row is a sentence, it is better to then cast it into a 2d array and filter out all 5
and 6
. The resultant 2d array can be used directly as the sentences
argument to initialize the model. The only catch is that when you want to query the model after training, you need to input the indices instead of the tokens.
Now one question you have is if the model takes integer directly. In the Python
version it doesn't check for type, and just passes the unique tokens around. Your unique indices in that case will work fine. But most of the time you would want to use the C-Extended routine to train your model, which is a big deal because it can give 70x performance. [2] I imagine in that case the C code may check for string type, which means there is a string-to-index mapping stored.
Is this inefficient? I think not, because the strings you have are numbers, which are in generally much shorter than the real token they represent (assuming they are compact indices from 0
). Therefore models will be smaller in size, which will save some effort in serialization and deserialization of the model at the end. You essentially have encoded the input tokens in a shorter string format and separated it from the word2vec
training, and word2vec
model do not and need not know this encoding happened before training.
My philosophy is try the simplest way first
. I would just throw a sample test input of integers to the model and see what can go wrong. Hope it helps.
[1] https://radimrehurek.com/gensim/models/word2vec.html
[2] http://rare-technologies.com/word2vec-in-python-part-two-optimizing/