topic-modeling

LDA model generates different topics everytime i train on the same corpus

假装没事ソ 提交于 2019-11-27 12:54:09
I am using python gensim to train an Latent Dirichlet Allocation (LDA) model from a small corpus of 231 sentences. However, each time i repeat the process, it generates different topics. Why does the same LDA parameters and corpus generate different topics everytime? And how do i stabilize the topic generation? I'm using this corpus ( http://pastebin.com/WptkKVF0 ) and this list of stopwords ( http://pastebin.com/LL7dqLcj ) and here's my code: from gensim import corpora, models, similarities from gensim.models import hdpmodel, ldamodel from itertools import izip from collections import

Predicting LDA topics for new data

蹲街弑〆低调 提交于 2019-11-27 09:53:51
问题 It looks like this question has may have been asked a few times before (here and here), but it has yet to be answered. I'm hoping this is due to the previous ambiguity of the question(s) asked, as indicated by comments. I apologize if I am breaking protocol by asking a simliar question again, I just assumed that those questions would not be seeing any new answers. Anyway, I am new to Latent Dirichlet Allocation and am exploring its use as a means of dimension reduction for textual data.

LDA with topicmodels, how can I see which topics different documents belong to?

爱⌒轻易说出口 提交于 2019-11-27 07:00:41
I am using LDA from the topicmodels package, and I have run it on about 30.000 documents, acquired 30 topics, and got the top 10 words for the topics, they look very good. But I would like to see which documents belong to which topic with the highest probability, how can I do that? myCorpus <- Corpus(VectorSource(userbios$bio)) docs <- userbios$twitter_id myCorpus <- tm_map(myCorpus, tolower) myCorpus <- tm_map(myCorpus, removePunctuation) myCorpus <- tm_map(myCorpus, removeNumbers) removeURL <- function(x) gsub("http[[:alnum:]]*", "", x) myCorpus <- tm_map(myCorpus, removeURL) myStopwords <-

Remove empty documents from DocumentTermMatrix in R topicmodels?

我怕爱的太早我们不能终老 提交于 2019-11-27 06:37:34
I am doing topic modelling using the topicmodels package in R. I am creating a Corpus object, doing some basic preprocessing, and then creating a DocumentTermMatrix: corpus <- Corpus(VectorSource(vec), readerControl=list(language="en")) corpus <- tm_map(corpus, tolower) corpus <- tm_map(corpus, removePunctuation) corpus <- tm_map(corpus, removeWords, stopwords("english")) corpus <- tm_map(corpus, stripWhitespace) corpus <- tm_map(corpus, removeNumbers) ...snip removing several custom lists of stopwords... corpus <- tm_map(corpus, stemDocument) dtm <- DocumentTermMatrix(corpus, control=list

How to print the LDA topics models from gensim? Python

我们两清 提交于 2019-11-27 05:17:56
问题 Using gensim I was able to extract topics from a set of documents in LSA but how do I access the topics generated from the LDA models? When printing the lda.print_topics(10) the code gave the following error because print_topics() return a NoneType : Traceback (most recent call last): File "/home/alvas/workspace/XLINGTOP/xlingtop.py", line 93, in <module> for top in lda.print_topics(2): TypeError: 'NoneType' object is not iterable The code: from gensim import corpora, models, similarities

Topic models: cross validation with loglikelihood or perplexity

*爱你&永不变心* 提交于 2019-11-26 23:55:11
问题 This question was migrated from Cross Validated because it can be answered on Stack Overflow. Migrated 5 years ago . I'm clustering documents using topic modeling. I need to come up with the optimal topic numbers. So, I decided to do ten fold cross validation with topics 10, 20, ...60. I have divided my corpus into ten batches and set aside one batch for a holdout set. I have ran latent dirichlet allocation (LDA) using nine batches (total 180 documents) with topics 10 to 60. Now, I have to

Spark MLlib LDA, how to infer the topics distribution of a new unseen document?

流过昼夜 提交于 2019-11-26 22:31:44
i am interested in applying LDA topic modelling using Spark MLlib. I have checked the code and the explanations in here but I couldn't find how to use the model then to find the topic distribution in a new unseen document. Jason Lenderman As of Spark 1.5 this functionality has not been implemented for the DistributedLDAModel . What you're going to need to do is convert your model to a LocalLDAModel using the toLocal method and then call the topicDistributions(documents: RDD[(Long, Vector]) method where documents are the new (i.e. out-of-training) documents, something like this: newDocuments:

LDA with topicmodels, how can I see which topics different documents belong to?

无人久伴 提交于 2019-11-26 10:36:14
问题 This question was migrated from Cross Validated because it can be answered on Stack Overflow. Migrated 6 years ago . I am using LDA from the topicmodels package, and I have run it on about 30.000 documents, acquired 30 topics, and got the top 10 words for the topics, they look very good. But I would like to see which documents belong to which topic with the highest probability, how can I do that? myCorpus <- Corpus(VectorSource(userbios$bio)) docs <- userbios$twitter_id myCorpus <- tm_map