topic-modeling

how to get a probability distribution for a topic in mallet?

徘徊边缘 提交于 2019-12-02 10:26:24
Using mallet I can get a specific number of topics and their words. How can I make sure topic words make a probability distribution (ie sum to one)? For example if I run it as bellow, how can I use the outputs given by mallet to make sure probabilities of topic words for topic 0 adds up to 1? mallet train-topics --input text.vectors --output-topic-keys topics.txt --output-doc-topics doc_comp.txt --topic-word-weights-file weights.txt --num-top-words 50 --word-topic-counts-file counts.txt --num-topics 3 --output-state topicstate.gz --alpha 1 来源: https://stackoverflow.com/questions/33251703/how

How to reproduce exact results with LDA function in R's topicmodels package

*爱你&永不变心* 提交于 2019-12-01 09:29:35
I've been unable to create reproducible results from topicmodels' LDA function. To take an example from their documentation: library(topicmodels) set.seed(0) lda1 <- LDA(AssociatedPress[1:20, ], control=list(seed=0), k=2) set.seed(0) lda2 <- LDA(AssociatedPress[1:20, ], control=list(seed=0), k=2) identical(lda1, lda2) # [1] FALSE How can I get identical results from two separate calls to LDA? As an aside (in case the package authors are on here), I find the control=list(seed=0) snippet unfortunate and unnecessary. Behind the scenes, there's a line for if (missing(seed)) seed <- as.integer(Sys

How to reproduce exact results with LDA function in R's topicmodels package

醉酒当歌 提交于 2019-12-01 07:18:36
问题 I've been unable to create reproducible results from topicmodels' LDA function. To take an example from their documentation: library(topicmodels) set.seed(0) lda1 <- LDA(AssociatedPress[1:20, ], control=list(seed=0), k=2) set.seed(0) lda2 <- LDA(AssociatedPress[1:20, ], control=list(seed=0), k=2) identical(lda1, lda2) # [1] FALSE How can I get identical results from two separate calls to LDA? As an aside (in case the package authors are on here), I find the control=list(seed=0) snippet

Making gsub only replace entire words?

岁酱吖の 提交于 2019-11-30 08:17:55
问题 (I'm using R.) For a list of words that's called "goodwords.corpus", I am looping through the documents in a corpus, and replacing each of the words on the list "goodwords.corpus" with the word + a number. So for example if the word "good" is on the list, and "goodnight" is NOT on the list, then this document: I am having a good time goodnight would turn into: I am having a good 1234 time goodnight **I'm using this code (EDIT- made this reproducible): goodwords.corpus <- c("good") test <- "I

Topic Modeling: How do I use my fitted LDA model to predict new topics for a new dataset in R?

烈酒焚心 提交于 2019-11-30 05:26:23
I am using 'lda' package in R for topic modeling. I want to predict new topics(collection of related words in a document) using a fitted Latent Dirichlet Allocation(LDA) model for new dataset. In the process, I came across predictive.distribution() function. But the function takes document_sums as input parameter which is an output of the result after fitting the new model. I need help to understand the use of existing model on new dataset and predict topics. Here is the example code present in the documentation written by Johnathan Chang for the package: Here is the code for it: #Fit a model

Making gsub only replace entire words?

ぃ、小莉子 提交于 2019-11-29 06:24:48
(I'm using R.) For a list of words that's called "goodwords.corpus", I am looping through the documents in a corpus, and replacing each of the words on the list "goodwords.corpus" with the word + a number. So for example if the word "good" is on the list, and "goodnight" is NOT on the list, then this document: I am having a good time goodnight would turn into: I am having a good 1234 time goodnight **I'm using this code (EDIT- made this reproducible): goodwords.corpus <- c("good") test <- "I am having a good time goodnight" for (i in 1:length(goodwords.corpus)){ test <-gsub(goodwords.corpus[[i

Topic Modeling: How do I use my fitted LDA model to predict new topics for a new dataset in R?

假如想象 提交于 2019-11-29 03:51:33
问题 I am using 'lda' package in R for topic modeling. I want to predict new topics(collection of related words in a document) using a fitted Latent Dirichlet Allocation(LDA) model for new dataset. In the process, I came across predictive.distribution() function. But the function takes document_sums as input parameter which is an output of the result after fitting the new model. I need help to understand the use of existing model on new dataset and predict topics. Here is the example code present

Predicting LDA topics for new data

谁说胖子不能爱 提交于 2019-11-28 16:45:19
It looks like this question has may have been asked a few times before ( here and here ), but it has yet to be answered. I'm hoping this is due to the previous ambiguity of the question(s) asked, as indicated by comments. I apologize if I am breaking protocol by asking a simliar question again, I just assumed that those questions would not be seeing any new answers. Anyway, I am new to Latent Dirichlet Allocation and am exploring its use as a means of dimension reduction for textual data. Ultimately I would like extract a smaller set of topics from a very large bag of words and build a

How to print the LDA topics models from gensim? Python

半世苍凉 提交于 2019-11-28 04:01:17
Using gensim I was able to extract topics from a set of documents in LSA but how do I access the topics generated from the LDA models? When printing the lda.print_topics(10) the code gave the following error because print_topics() return a NoneType : Traceback (most recent call last): File "/home/alvas/workspace/XLINGTOP/xlingtop.py", line 93, in <module> for top in lda.print_topics(2): TypeError: 'NoneType' object is not iterable The code: from gensim import corpora, models, similarities from gensim.models import hdpmodel, ldamodel from itertools import izip documents = ["Human machine

Topic models: cross validation with loglikelihood or perplexity

只愿长相守 提交于 2019-11-28 03:04:49
I'm clustering documents using topic modeling. I need to come up with the optimal topic numbers. So, I decided to do ten fold cross validation with topics 10, 20, ...60. I have divided my corpus into ten batches and set aside one batch for a holdout set. I have ran latent dirichlet allocation (LDA) using nine batches (total 180 documents) with topics 10 to 60. Now, I have to calculate perplexity or log likelihood for the holdout set. I found this code from one of CV's discussion sessions. I really don't understand several lines of codes below. I have dtm matrix using the holdout set (20