topic-modeling | 易学教程

LDA model generates different topics everytime i train on the same corpus

阅读更多关于 LDA model generates different topics everytime i train on the same corpus

I am using python gensim to train an Latent Dirichlet Allocation (LDA) model from a small corpus of 231 sentences. However, each time i repeat the process, it generates different topics. Why does the same LDA parameters and corpus generate different topics everytime? And how do i stabilize the topic generation? I'm using this corpus ( http://pastebin.com/WptkKVF0 ) and this list of stopwords ( http://pastebin.com/LL7dqLcj ) and here's my code: from gensim import corpora, models, similarities from gensim.models import hdpmodel, ldamodel from itertools import izip from collections import

Predicting LDA topics for new data

阅读更多关于 Predicting LDA topics for new data

问题 It looks like this question has may have been asked a few times before (here and here), but it has yet to be answered. I'm hoping this is due to the previous ambiguity of the question(s) asked, as indicated by comments. I apologize if I am breaking protocol by asking a simliar question again, I just assumed that those questions would not be seeing any new answers. Anyway, I am new to Latent Dirichlet Allocation and am exploring its use as a means of dimension reduction for textual data.

LDA with topicmodels, how can I see which topics different documents belong to?

阅读更多关于 LDA with topicmodels, how can I see which topics different documents belong to?

I am using LDA from the topicmodels package, and I have run it on about 30.000 documents, acquired 30 topics, and got the top 10 words for the topics, they look very good. But I would like to see which documents belong to which topic with the highest probability, how can I do that? myCorpus <- Corpus(VectorSource(userbios$bio)) docs <- userbios$twitter_id myCorpus <- tm_map(myCorpus, tolower) myCorpus <- tm_map(myCorpus, removePunctuation) myCorpus <- tm_map(myCorpus, removeNumbers) removeURL <- function(x) gsub("http[[:alnum:]]*", "", x) myCorpus <- tm_map(myCorpus, removeURL) myStopwords <-

Remove empty documents from DocumentTermMatrix in R topicmodels?

阅读更多关于 Remove empty documents from DocumentTermMatrix in R topicmodels?

I am doing topic modelling using the topicmodels package in R. I am creating a Corpus object, doing some basic preprocessing, and then creating a DocumentTermMatrix: corpus <- Corpus(VectorSource(vec), readerControl=list(language="en")) corpus <- tm_map(corpus, tolower) corpus <- tm_map(corpus, removePunctuation) corpus <- tm_map(corpus, removeWords, stopwords("english")) corpus <- tm_map(corpus, stripWhitespace) corpus <- tm_map(corpus, removeNumbers) ...snip removing several custom lists of stopwords... corpus <- tm_map(corpus, stemDocument) dtm <- DocumentTermMatrix(corpus, control=list

How to print the LDA topics models from gensim? Python

阅读更多关于 How to print the LDA topics models from gensim? Python

问题 Using gensim I was able to extract topics from a set of documents in LSA but how do I access the topics generated from the LDA models? When printing the lda.print_topics(10) the code gave the following error because print_topics() return a NoneType : Traceback (most recent call last): File "/home/alvas/workspace/XLINGTOP/xlingtop.py", line 93, in <module> for top in lda.print_topics(2): TypeError: 'NoneType' object is not iterable The code: from gensim import corpora, models, similarities

Topic models: cross validation with loglikelihood or perplexity

阅读更多关于 Topic models: cross validation with loglikelihood or perplexity

问题 This question was migrated from Cross Validated because it can be answered on Stack Overflow. Migrated 5 years ago . I'm clustering documents using topic modeling. I need to come up with the optimal topic numbers. So, I decided to do ten fold cross validation with topics 10, 20, ...60. I have divided my corpus into ten batches and set aside one batch for a holdout set. I have ran latent dirichlet allocation (LDA) using nine batches (total 180 documents) with topics 10 to 60. Now, I have to

Spark MLlib LDA, how to infer the topics distribution of a new unseen document?

阅读更多关于 Spark MLlib LDA, how to infer the topics distribution of a new unseen document?

i am interested in applying LDA topic modelling using Spark MLlib. I have checked the code and the explanations in here but I couldn't find how to use the model then to find the topic distribution in a new unseen document. Jason Lenderman As of Spark 1.5 this functionality has not been implemented for the DistributedLDAModel . What you're going to need to do is convert your model to a LocalLDAModel using the toLocal method and then call the topicDistributions(documents: RDD[(Long, Vector]) method where documents are the new (i.e. out-of-training) documents, something like this: newDocuments:

LDA with topicmodels, how can I see which topics different documents belong to?

阅读更多关于 LDA with topicmodels, how can I see which topics different documents belong to?

问题 This question was migrated from Cross Validated because it can be answered on Stack Overflow. Migrated 6 years ago . I am using LDA from the topicmodels package, and I have run it on about 30.000 documents, acquired 30 topics, and got the top 10 words for the topics, they look very good. But I would like to see which documents belong to which topic with the highest probability, how can I do that? myCorpus <- Corpus(VectorSource(userbios$bio)) docs <- userbios$twitter_id myCorpus <- tm_map