lda

How to evaluate the best K for LDA using Mallet?

烈酒焚心 提交于 2019-12-04 16:13:50
I am using Mallet api to extract topic from twitter data and I have already extracted topics which are seems good topic. But I am facing problem to estimating K. For example I fixed K value from 10 to 100. So, I have taken different number of topics from the data. But, now I would like to estimate which K is best. There are some algorithm I know as Perplexity Empirical likelihood Marginal likelihood (Harmonic mean method) Silhouette I found a method model.estimate() which may be used to estimate with different value of K. But I am not getting any idea to show the value of K is best for the

R LDA Topic Modeling: Result topics contains very similar words

蹲街弑〆低调 提交于 2019-12-04 14:05:11
All: I am beginner in R topic modeling, it all started three weeks ago. So my problem is I can successfully processed my data into corpus, Document term matrix and LDA function. I have tweets as my input and about 460,000 tweets. But I am not happy with the result, the words across all topic are very similar. packages <- c('tm','topicmodels','SnowballC','RWeka','rJava') if (length(setdiff(packages, rownames(installed.packages()))) > 0) { install.packages(setdiff(packages, rownames(installed.packages()))) } options( java.parameters = "-Xmx4g" ) library(tm) library(topicmodels) library(SnowballC

Error when implementing gensim.LdaMallet

老子叫甜甜 提交于 2019-12-04 09:38:55
I was following the instructions on this link (" http://radimrehurek.com/2014/03/tutorial-on-mallet-in-python/ "), however I came across an error when I tried to train the model: model = models.LdaMallet(mallet_path, corpus, num_topics =10, id2word = corpus.dictionary) IOError: [Errno 2] No such file or directory: 'c:\\users\\brlu\\appdata\\local\\temp\\c6a13a_state.mallet.gz' Please share any thoughts you might have. Thanks. This can happen for two reasons: 1. You have space in your mallet path. 2. There is no MALLET_HOME environment variable. Make sure that mallet properly works from command

How to get a complete topic distribution for a document using gensim LDA?

匆匆过客 提交于 2019-12-04 09:38:14
问题 When I train my lda model as such dictionary = corpora.Dictionary(data) corpus = [dictionary.doc2bow(doc) for doc in data] num_cores = multiprocessing.cpu_count() num_topics = 50 lda = LdaMulticore(corpus, num_topics=num_topics, id2word=dictionary, workers=num_cores, alpha=1e-5, eta=5e-1) I want to get a full topic distribution for all num_topics for each and every document. That is, in this particular case, I want each document to have 50 topics contributing to the distribution and I want to

Implementing Topic Model with Python (numpy)

给你一囗甜甜゛ 提交于 2019-12-04 08:25:05
Recently, I implemented Gibbs sampling for LDA topic model on Python using numpy, taking as a reference some code from a site. In each iteration of Gibbs sampling, we remove one (current) word, sample a new topic for that word according to a posterior conditional probability distribution inferred from the LDA model, and update word-topic counts, as follows: for m, doc in enumerate(docs): #m: doc id for n, t in enumerate(doc): #n: id of word inside document, t: id of the word globally # discount counts for word t with associated topic z z = z_m_n[m][n] n_m_z[m][z] -= 1 n_z_t[z, t] -= 1 n_z[z] -

Memory error in python using numpy array

夙愿已清 提交于 2019-12-04 06:08:08
问题 I am getting the following error for this code: model = lda.LDA(n_topics=15, n_iter=50, random_state=1) model.fit(X) topic_word = model.topic_word_ print("type(topic_word): {}".format(type(topic_word))) print("shape: {}".format(topic_word.shape)) print ("\n") n = 15 doc_topic=model.doc_topic_ for i in range(15): print("{} (top topic: {})".format(titles[i], doc_topic[0][i].argmax())) topic_csharp=np.zeros(shape=[1,n]) np.copyto(topic_csharp,doc_topic[0][i]) for i, topic_dist in enumerate(topic

Latent Dirichlet allocation (LDA) in Spark - replicate model

本秂侑毒 提交于 2019-12-03 16:56:27
I want to save the LDA model from pyspark ml-clustering package and apply the model to the training & test data-set after saving. However results diverge despite setting a seed. My code is the following: 1) Import packages from pyspark.ml.clustering import LocalLDAModel, DistributedLDAModel from pyspark.ml.feature import CountVectorizer , IDF 2) Preparing the dataset countVectors = CountVectorizer(inputCol="requester_instruction_words_filtered_complete", outputCol="raw_features", vocabSize=5000, minDF=10.0) cv_model = countVectors.fit(tokenized_stopwords_sample_df) result_tf = cv_model

how to determine the number of topics for LDA?

给你一囗甜甜゛ 提交于 2019-12-03 11:50:56
问题 I am a freshman in LDA and I want to use it in my work. However, some problems appear. In order to get the best performance, I want to estimate the best topic number. After reading "Finding Scientific topics", I know that I can calculate logP(w|z) firstly and then use the harmonic mean of a series of P(w|z) to estimate P(w|T). My question is what does the "a series of" mean? 回答1: Unfortunately, there is no hard science yielding the correct answer to your question. To the best of my knowledge,

LDA and topic model

匿名 (未验证) 提交于 2019-12-03 08:54:24
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 由 翻译 强力驱动 问题: I have studied LDA and Topic model for several weeks.But due to my poor mathematics ability, i can not fully understand its inner algorithms.I have used the GibbsLDA implementation, input a lot of documents, and set topic number as 100, i got a file named "final.theta" which stores the topic proportion of each topic in each document.This result is good, i can use the topic proportion to do many other things. But when i tried Blei's C language implementation on LDA, i only got a file named final.gamma, but i don't know how to

Can we use a self made corpus for training for LDA using gensim?

醉酒当歌 提交于 2019-12-03 07:40:28
问题 I have to apply LDA (Latent Dirichlet Allocation) to get the possible topics from a data base of 20,000 documents that I collected. How can I use these documents rather than the other corpus available like the Brown Corpus or English Wikipedia as training corpus ? You can refer this page. 回答1: After going through the documentation of the Gensim package, I found out that there are total 4 ways of transforming a text repository into a corpus. There are total 4 formats for the corpus: Market