topic-modeling | 易学教程

LDA topic model using R text2vec package and LDAvis in shinyApp

阅读更多关于 LDA topic model using R text2vec package and LDAvis in shinyApp

问题 Here is the code for LDA topic modelling with R text2vec package: library(text2vec) tokens = docs$text %>% # docs$text: a colection of text documents word_tokenizer it = itoken(tokens, ids = docs$id, progressbar = FALSE) v = create_vocabulary(it) %>% prune_vocabulary(term_count_min = 10, doc_proportion_max = 0.2) vectorizer = vocab_vectorizer(v) dtm = create_dtm(it, vectorizer, type = "dgTMatrix") lda_model = text2vec::LDA$new(n_topics = 10, doc_topic_prior = 0.1, topic_word_prior = 0.01) doc

Folding in (estimating topics for new documents) in LDA using Mallet in Java

阅读更多关于 Folding in (estimating topics for new documents) in LDA using Mallet in Java

问题 I'm using Mallet through Java, and I can't work out how to evaluate new documents against an existing topic model which I have trained. My initial code to generate my model is very similar to that in the Mallett Developers Guide for Topic Modelling, after which I simply save the model as a Java object. In a later process, I reload that Java object from file, add new instances via .addInstances() and would then like to evaluate only these new instances against the topics found in the original

R LDA Topic Modeling: Result topics contains very similar words

阅读更多关于 R LDA Topic Modeling: Result topics contains very similar words

问题 All: I am beginner in R topic modeling, it all started three weeks ago. So my problem is I can successfully processed my data into corpus, Document term matrix and LDA function. I have tweets as my input and about 460,000 tweets. But I am not happy with the result, the words across all topic are very similar. packages <- c('tm','topicmodels','SnowballC','RWeka','rJava') if (length(setdiff(packages, rownames(installed.packages()))) > 0) { install.packages(setdiff(packages, rownames(installed

Run cvb in mahout 0.8

阅读更多关于 Run cvb in mahout 0.8

问题 The current Mahout 0.8-SNAPSHOT includes a Collapsed Variational Bayes (cvb) version for Topic Modeling and removed the Latent Dirichlet Analysis (lda) approach, because cvb can be parallelized way better. Unfortunately there is only documentation for lda on how to run an example and generate meaningful output. Thus, I want to: preprocess some texts correctly run the cvb0_local version of cvb inspect the results by looking at the top n words in each of the generated topics 回答1: So here are

LDA model generates different topics everytime i train on the same corpus

阅读更多关于 LDA model generates different topics everytime i train on the same corpus

问题 I am using python gensim to train an Latent Dirichlet Allocation (LDA) model from a small corpus of 231 sentences. However, each time i repeat the process, it generates different topics. Why does the same LDA parameters and corpus generate different topics everytime? And how do i stabilize the topic generation? I'm using this corpus (http://pastebin.com/WptkKVF0) and this list of stopwords (http://pastebin.com/LL7dqLcj) and here's my code: from gensim import corpora, models, similarities from

Remove empty documents from DocumentTermMatrix in R topicmodels?

阅读更多关于 Remove empty documents from DocumentTermMatrix in R topicmodels?

问题 I am doing topic modelling using the topicmodels package in R. I am creating a Corpus object, doing some basic preprocessing, and then creating a DocumentTermMatrix: corpus <- Corpus(VectorSource(vec), readerControl=list(language="en")) corpus <- tm_map(corpus, tolower) corpus <- tm_map(corpus, removePunctuation) corpus <- tm_map(corpus, removeWords, stopwords("english")) corpus <- tm_map(corpus, stripWhitespace) corpus <- tm_map(corpus, removeNumbers) ...snip removing several custom lists of

Spark MLlib LDA, how to infer the topics distribution of a new unseen document?

阅读更多关于 Spark MLlib LDA, how to infer the topics distribution of a new unseen document?

问题 i am interested in applying LDA topic modelling using Spark MLlib. I have checked the code and the explanations in here but I couldn't find how to use the model then to find the topic distribution in a new unseen document. 回答1: As of Spark 1.5 this functionality has not been implemented for the DistributedLDAModel . What you're going to need to do is convert your model to a LocalLDAModel using the toLocal method and then call the topicDistributions(documents: RDD[(Long, Vector]) method where

Term weighting for original LDA in gensim

阅读更多关于 Term weighting for original LDA in gensim

问题 I am using the gensim library to apply LDA to a set of documents. Using gensim I can apply LDA to a corpus whatever the term weights are: binary, tf, tf-idf... My question is, what is the term weighting that should be used for the original LDA? If I have understood correctly the weights should be term frequencies, but I am not sure. 回答1: It should be a corpus represented as a "bag of words". Or, yes, lists of term counts. The correct format is that of the corpus defined in the first tutorial

R: How to generate vectors of highest value in each row? [duplicate]

阅读更多关于 R: How to generate vectors of highest value in each row? [duplicate]

问题 This question already has answers here : For each row return the column name of the largest value (7 answers) Closed last year . Let's say that my data frame contains > DF V1 V2 V3 1 0.3 0.4 0.7 2 0.4 0.2 0.1 3 0.2 0.8 0.3 4 0.5 0.8 0.9 5 0.2 0.7 0.8 6 0.8 0.3 0.6 7 0.1 0.5 0.4 the rows would be the different types of automobiles, and the columns would be the probabilities for a given category of V1, V2, V3. I want to generate a vector that assigns to each automobile the category of which it

Implementing Topic Model with Python (numpy)

阅读更多关于 Implementing Topic Model with Python (numpy)

问题 Recently, I implemented Gibbs sampling for LDA topic model on Python using numpy, taking as a reference some code from a site. In each iteration of Gibbs sampling, we remove one (current) word, sample a new topic for that word according to a posterior conditional probability distribution inferred from the LDA model, and update word-topic counts, as follows: for m, doc in enumerate(docs): #m: doc id for n, t in enumerate(doc): #n: id of word inside document, t: id of the word globally #