How to abstract bigram topics instead of unigrams using Latent Dirichlet Allocation (LDA) in python- gensim?

天大地大妈咪最大 提交于 2021-02-06 09:26:09

问题


LDA Original Output

  • Uni-grams

    • topic1 -scuba,water,vapor,diving

    • topic2 -dioxide,plants,green,carbon

Required Output

  • Bi-gram topics

    • topic1 -scuba diving,water vapor

    • topic2 -green plants,carbon dioxide

Any idea?


回答1:


Given I have a dict called docs, containing lists of words from documents, I can turn it into an array of words + bigrams (or also trigrams etc.) using nltk.util.ngrams or your own function like this:

from nltk.util import ngrams

for doc in docs:
    docs[doc] = docs[doc] + ["_".join(w) for w in ngrams(docs[doc], 2)]

Then you pass the values of this dict to the LDA model as a corpus. Bigrams joined by underscores are thus treated as single tokens.




回答2:


You can use word2vec to get most similar terms from the top n topics abstracted using LDA.

LDA Output

Create a dictionary of bi-grams using topics abstracted (for ex:-san_francisco)

check http://www.markhneedham.com/blog/2015/02/12/pythongensim-creating-bigrams-over-how-i-met-your-mother-transcripts/

Then, do word2vec to get most similar words (uni-grams,bi-grams etc)

Word and Cosine distance

los_angeles (0.666175)
golden_gate (0.571522)
oakland (0.557521)

check https://code.google.com/p/word2vec/ (From words to phrases and beyond)



来源:https://stackoverflow.com/questions/32476336/how-to-abstract-bigram-topics-instead-of-unigrams-using-latent-dirichlet-allocat

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!