topic-modeling

Spark LDA woes - prediction and OOM questions

安稳与你 提交于 2019-12-05 21:43:28
I'm evaluating Spark 1.6.0 to build and predict against large (millions of docs, millions of features, thousands of topics) LDA models, something I can accomplish pretty easily with Yahoo! LDA. Starting small, following the Java examples, I built a 100K doc/600K feature/250 topic/100 iteration model using the Distributed model/EM optimizer. The model built fine and the resulting topics were coherent. I then wrote a wrapper around the new ​single-document prediction routine (SPARK-10809; which I cherry picked into a custom Spark 1.6.0-based distribution) to get topics for new, unseen documents

Removing stopwords from a user-defined corpus in R

你说的曾经没有我的故事 提交于 2019-12-05 02:03:51
问题 I have a set of documents: documents = c("She had toast for breakfast", "The coffee this morning was excellent", "For lunch let's all have pancakes", "Later in the day, there will be more talks", "The talks on the first day were great", "The second day should have good presentations too") In this set of documents, I would like to remove the stopwords. I have already removed punctuation and converted to lower case, using: documents = tolower(documents) #make it lower case documents = gsub('[[

Latent Dirichlet Allocation with prior topic words

痞子三分冷 提交于 2019-12-04 17:25:52
Context I'm trying to extract topics from a set of texts using Latent Dirichlet allocation from Scikit-Learn's decomposition module . This works really well, except for the quality of topic words found/selected. In a article by Li et al (2017) , the authors describe using prior topic words as input for the LDA. They manually choose 4 topics and the main words associated/belong to these topics. For these words they set the default value to a high number for the associated topic and 0 for the other topics. All other words (not manually selected for a topic) are given equal values for all topics

Folding in (estimating topics for new documents) in LDA using Mallet in Java

僤鯓⒐⒋嵵緔 提交于 2019-12-04 16:20:21
I'm using Mallet through Java, and I can't work out how to evaluate new documents against an existing topic model which I have trained. My initial code to generate my model is very similar to that in the Mallett Developers Guide for Topic Modelling , after which I simply save the model as a Java object. In a later process, I reload that Java object from file, add new instances via .addInstances() and would then like to evaluate only these new instances against the topics found in the original training set. This stats.SE thread provides some high-level suggestions, but I can't see how to work

How to evaluate the best K for LDA using Mallet?

烈酒焚心 提交于 2019-12-04 16:13:50
I am using Mallet api to extract topic from twitter data and I have already extracted topics which are seems good topic. But I am facing problem to estimating K. For example I fixed K value from 10 to 100. So, I have taken different number of topics from the data. But, now I would like to estimate which K is best. There are some algorithm I know as Perplexity Empirical likelihood Marginal likelihood (Harmonic mean method) Silhouette I found a method model.estimate() which may be used to estimate with different value of K. But I am not getting any idea to show the value of K is best for the

R LDA Topic Modeling: Result topics contains very similar words

蹲街弑〆低调 提交于 2019-12-04 14:05:11
All: I am beginner in R topic modeling, it all started three weeks ago. So my problem is I can successfully processed my data into corpus, Document term matrix and LDA function. I have tweets as my input and about 460,000 tweets. But I am not happy with the result, the words across all topic are very similar. packages <- c('tm','topicmodels','SnowballC','RWeka','rJava') if (length(setdiff(packages, rownames(installed.packages()))) > 0) { install.packages(setdiff(packages, rownames(installed.packages()))) } options( java.parameters = "-Xmx4g" ) library(tm) library(topicmodels) library(SnowballC

Implementing Topic Model with Python (numpy)

给你一囗甜甜゛ 提交于 2019-12-04 08:25:05
Recently, I implemented Gibbs sampling for LDA topic model on Python using numpy, taking as a reference some code from a site. In each iteration of Gibbs sampling, we remove one (current) word, sample a new topic for that word according to a posterior conditional probability distribution inferred from the LDA model, and update word-topic counts, as follows: for m, doc in enumerate(docs): #m: doc id for n, t in enumerate(doc): #n: id of word inside document, t: id of the word globally # discount counts for word t with associated topic z z = z_m_n[m][n] n_m_z[m][z] -= 1 n_z_t[z, t] -= 1 n_z[z] -

Memory error in python using numpy array

夙愿已清 提交于 2019-12-04 06:08:08
问题 I am getting the following error for this code: model = lda.LDA(n_topics=15, n_iter=50, random_state=1) model.fit(X) topic_word = model.topic_word_ print("type(topic_word): {}".format(type(topic_word))) print("shape: {}".format(topic_word.shape)) print ("\n") n = 15 doc_topic=model.doc_topic_ for i in range(15): print("{} (top topic: {})".format(titles[i], doc_topic[0][i].argmax())) topic_csharp=np.zeros(shape=[1,n]) np.copyto(topic_csharp,doc_topic[0][i]) for i, topic_dist in enumerate(topic

error Installing topicmodels in R Ubuntu

时光总嘲笑我的痴心妄想 提交于 2019-12-04 05:45:12
I am getting error while installing topicmodels package in R . on running install.packages("topicmodels",dependencies=TRUE) following are the last few lines I am getting. Please help. My R version is 3.1.3 . g++ -I/usr/share/R/include -DNDEBUG -fpic -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Wformat-security -Werror=format-security -D_FORTIFY_SOURCE=2 -g -c utilities.cpp -o utilities.o gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -fpic -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Wformat-security -Werror=format-security -D_FORTIFY_SOURCE=2 -g -c utils.c -o

Removing stopwords from a user-defined corpus in R

谁说我不能喝 提交于 2019-12-03 16:13:45
I have a set of documents: documents = c("She had toast for breakfast", "The coffee this morning was excellent", "For lunch let's all have pancakes", "Later in the day, there will be more talks", "The talks on the first day were great", "The second day should have good presentations too") In this set of documents, I would like to remove the stopwords. I have already removed punctuation and converted to lower case, using: documents = tolower(documents) #make it lower case documents = gsub('[[:punct:]]', '', documents) #remove punctuation First I convert to a Corpus object: documents <- Corpus