How to abstract bigram topics instead of unigrams using Latent Dirichlet Allocation (LDA) in python- gensim?

前端 未结 2 1557
猫巷女王i
猫巷女王i 2021-02-06 14:32

LDA Original Output

  • Uni-grams

    • topic1 -scuba,water,vapor,diving

    • topic2 -dioxide,plants,green,carbon

2条回答
  •  既然无缘
    2021-02-06 14:58

    Given I have a dict called docs, containing lists of words from documents, I can turn it into an array of words + bigrams (or also trigrams etc.) using nltk.util.ngrams or your own function like this:

    from nltk.util import ngrams
    
    for doc in docs:
        docs[doc] = docs[doc] + ["_".join(w) for w in ngrams(docs[doc], 2)]
    

    Then you pass the values of this dict to the LDA model as a corpus. Bigrams joined by underscores are thus treated as single tokens.

提交回复
热议问题