How do I print lda topic model and the word cloud of each of the topics

后端 未结 2 1797
眼角桃花
眼角桃花 2020-12-29 15:59
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from gensim import corpora, models
import gensim
import os
from os import path
from t         


        
相关标签:
2条回答
  • 2020-12-29 16:12

    The following worked for me: First, create a lda model and define clusters/topics as discussed in Topic Clustering - Make sure the minimum_probability is 0. Next, determine the LDA corpus using lda_corpus = lda[corpus] Now identify the documents from the data belonging to each Topic as a list, below example has two topics. df is my raw data that has a column texts

    cluster1 = [j for i,j in zip(lda_corpus,df.texts) if i[0][1] > .2]
    cluster2 = [j for i,j in zip(lda_corpus,df.texts) if i[1][1] > .2]
    

    Obtain the Word Cloud for each cluster. You can include as many stop words as we can. Make sure to clean the data in the cluster like remove stopwords, stemming etc. I am skipping those steps, so that each cluster will have cleaned texts/documents.

    wordcloud = WordCloud(relative_scaling = 1.0, stopwords=("xxx", 'yyy').generate(' '. join(cluster1))
    

    Finally Plot word cloud using matplotlib

    plt.imshow(wordcloud)
    
    0 讨论(0)
  • 2020-12-29 16:15

    Assuming you have trained a gensim lda model you can simply create a word cloud with the following code

    # lda is assumed to be the variable holding the LdaModel object
    import matplotlib.pyplot as plt
    for t in range(lda.num_topics):
        plt.figure()
        plt.imshow(WordCloud().fit_words(lda.show_topic(t, 200)))
        plt.axis("off")
        plt.title("Topic #" + str(t))
        plt.show()
    

    I will highlight a few mistakes on your code so you can better follow what I have written above.

    WordCloud().generate(something) expects something to be raw text. It will tokenize it, lowercase it and remove stop words and then compute the word cloud. You need the word sizes to match their probability in a topic (I assume).

    lda.print_topics(8, 200) returns a textual representation of the topics as in prob1*"token1" + prob2*"token2" + ... you need the lda.show_topic(topic, num_words) to get the word with the corresponding probability as tuples. Then you need WordCloud().fit_words() to generate the word cloud.

    The following code is your code with the above visualization. I would also like to point out that you are inferring topics from a single document which is very uncommon and probably not what you wanted.

    from nltk.tokenize import RegexpTokenizer
    from stop_words import get_stop_words
    from gensim import corpora, models
    import gensim
    import os
    from os import path
    from time import sleep
    import matplotlib.pyplot as plt
    import random
    from wordcloud import WordCloud, STOPWORDS
    tokenizer = RegexpTokenizer(r'\w+')
    en_stop = set(get_stop_words('en'))
    with open(os.path.join('c:\users\kaila\jobdescription.txt')) as f:
        Reader = f.read()
    
    Reader = Reader.replace("will", " ")
    Reader = Reader.replace("please", " ")
    
    
    texts = unicode(Reader, errors='replace')
    tdm = []
    
    raw = texts.lower()
    tokens = tokenizer.tokenize(raw)
    stopped_tokens = [i for i in tokens if not i in en_stop]
    tdm.append(stopped_tokens)
    
    dictionary = corpora.Dictionary(tdm)
    corpus = [dictionary.doc2bow(i) for i in tdm]
    ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=8, id2word = dictionary)
    for t in range(ldamodel.num_topics):
        plt.figure()
        plt.imshow(WordCloud().fit_words(ldamodel.show_topic(t, 200)))
        plt.axis("off")
        plt.title("Topic #" + str(t))
        plt.show()
    

    Although from a different library you can see topic visualizations with corresponding code for what the result will be (Disclaimer: I am on of the authors of that library).

    0 讨论(0)
提交回复
热议问题