I am trying to use the word2vec module from gensim
natural language processing library in Python.
The docs say to initialize the model:
A list of utf-8 sentences. You can also stream the data from the disk.
Make sure it's utf-8
, and split it:
sentences = [ "the quick brown fox jumps over the lazy dogs",
"Then a cop quizzed Mick Jagger's ex-wives briefly." ]
word2vec.Word2Vec([s.encode('utf-8').split() for s in sentences], size=100, window=5, min_count=5, workers=4)