How to load sentences into Python gensim?

后端未结

关注

 2  865

天涯浪人 2021-02-04 08:30

I am trying to use the word2vec module from gensim natural language processing library in Python.

The docs say to initialize the model:

2条回答

遥遥无期 (楼主)

2021-02-04 09:04

Like alKid pointed out, make it utf-8.

Talking about two additional things you might have to worry about.

Input is too large and you're loading it from a file.
Removing stop words from the sentences.

Instead of loading a big list into the memory, you can do something like:

import nltk, gensim
class FileToSent(object):    
    def __init__(self, filename):
        self.filename = filename
        self.stop = set(nltk.corpus.stopwords.words('english'))

    def __iter__(self):
        for line in open(self.filename, 'r'):
        ll = [i for i in unicode(line, 'utf-8').lower().split() if i not in self.stop]
        yield ll

And then,

sentences = FileToSent('sentence_file.txt')
model = gensim.models.Word2Vec(sentences=sentences, window=5, min_count=5, workers=4, hs=1)

0 讨论(0)

查看其它2个回答