How to load sentences into Python gensim?

后端 未结 2 863
天涯浪人
天涯浪人 2021-02-04 08:30

I am trying to use the word2vec module from gensim natural language processing library in Python.

The docs say to initialize the model:



        
相关标签:
2条回答
  • 2021-02-04 09:01

    A list of utf-8 sentences. You can also stream the data from the disk.

    Make sure it's utf-8, and split it:

    sentences = [ "the quick brown fox jumps over the lazy dogs",
    "Then a cop quizzed Mick Jagger's ex-wives briefly." ]
    word2vec.Word2Vec([s.encode('utf-8').split() for s in sentences], size=100, window=5, min_count=5, workers=4)
    
    0 讨论(0)
  • 2021-02-04 09:04

    Like alKid pointed out, make it utf-8.

    Talking about two additional things you might have to worry about.

    1. Input is too large and you're loading it from a file.
    2. Removing stop words from the sentences.

    Instead of loading a big list into the memory, you can do something like:

    import nltk, gensim
    class FileToSent(object):    
        def __init__(self, filename):
            self.filename = filename
            self.stop = set(nltk.corpus.stopwords.words('english'))
    
        def __iter__(self):
            for line in open(self.filename, 'r'):
            ll = [i for i in unicode(line, 'utf-8').lower().split() if i not in self.stop]
            yield ll
    

    And then,

    sentences = FileToSent('sentence_file.txt')
    model = gensim.models.Word2Vec(sentences=sentences, window=5, min_count=5, workers=4, hs=1)
    
    0 讨论(0)
提交回复
热议问题