lda

How to compute the log-likelihood of the LDA model in vowpal wabbit

末鹿安然 提交于 2019-12-22 10:19:55
问题 I am typical, regular, everyday R user. In R there is very helpful lda.collapsed.gibbs.sampler in lda package tha uses a collapsed Gibbs sampler to fit a latent Dirichlet allocation (LDA) model and returns point estimates of the latent parameters using the state at the last iteration of Gibbs sampling. This function also has a great parameter compute.log.likelihood which, when set to TRUE , will cause the sampler to compute the log likelihood of the words (to within a constant factor) after

the accuracy of LDA predict for new documents with Spark

无人久伴 提交于 2019-12-22 01:09:51
问题 I'm work with Mllib of Spark, and now is doing something with LDA. But when I use the code provided by Spark(see bellow) to predict a Doc used in training the model, the result(document-topics) of predict is at opposite poles with the result of trained document-topics. I don't know what caused the result. Asking for help, and here is my code below: train: $lda.run(corpus) the corpus is an RDD like this: $RDD[(Long, Vector)] the Vector contains vocabulary, index of words, wordcounts. predict:

R LDA Topic Modeling: Result topics contains very similar words

♀尐吖头ヾ 提交于 2019-12-21 20:49:44
问题 All: I am beginner in R topic modeling, it all started three weeks ago. So my problem is I can successfully processed my data into corpus, Document term matrix and LDA function. I have tweets as my input and about 460,000 tweets. But I am not happy with the result, the words across all topic are very similar. packages <- c('tm','topicmodels','SnowballC','RWeka','rJava') if (length(setdiff(packages, rownames(installed.packages()))) > 0) { install.packages(setdiff(packages, rownames(installed

Run cvb in mahout 0.8

送分小仙女□ 提交于 2019-12-20 10:57:13
问题 The current Mahout 0.8-SNAPSHOT includes a Collapsed Variational Bayes (cvb) version for Topic Modeling and removed the Latent Dirichlet Analysis (lda) approach, because cvb can be parallelized way better. Unfortunately there is only documentation for lda on how to run an example and generate meaningful output. Thus, I want to: preprocess some texts correctly run the cvb0_local version of cvb inspect the results by looking at the top n words in each of the generated topics 回答1: So here are

Extract document-topic matrix from Pyspark LDA Model

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-20 09:24:04
问题 I have successfully trained an LDA model in spark, via the Python API: from pyspark.mllib.clustering import LDA model=LDA.train(corpus,k=10) This works completely fine, but I now need the document -topic matrix for the LDA model, but as far as I can tell all I can get is the word -topic, using model.topicsMatrix() . Is there some way to get the document-topic matrix from the LDA model, and if not, is there an alternative method (other than implementing LDA from scratch) in Spark to run an LDA

朴素贝叶斯算法原理小结

↘锁芯ラ 提交于 2019-12-20 02:16:01
    文本主题模型之LDA(一) LDA基础      文本主题模型之LDA(二) LDA求解之Gibbs采样算法      文本主题模型之LDA(三) LDA求解之变分推断EM算法     在前面我们讲到了基于矩阵分解的LSI和NMF主题模型,这里我们开始讨论被广泛使用的主题模型:隐含狄利克雷分布(Latent Dirichlet Allocation,以下简称LDA)。注意机器学习还有一个LDA,即线性判别分析,主要是用于降维和分类的,如果大家需要了解这个LDA的信息,参看之前写的 线性判别分析LDA原理总结 。文本关注于隐含狄利克雷分布对应的LDA。 1. LDA贝叶斯模型     LDA是基于贝叶斯模型的,涉及到贝叶斯模型离不开“先验分布”,“数据(似然)”和"后验分布"三块。在 朴素贝叶斯算法原理小结 中我们也已经讲到了这套贝叶斯理论。在贝叶斯学派这里: 先验分布 + 数据(似然)= 后验分布     这点其实很好理解,因为这符合我们人的思维方式,比如你对好人和坏人的认知,先验分布为:100个好人和100个的坏人,即你认为好人坏人各占一半,现在你被2个好人(数据)帮助了和1个坏人骗了,于是你得到了新的后验分布为:102个好人和101个的坏人。现在你的后验分布里面认为好人比坏人多了。这个后验分布接着又变成你的新的先验分布,当你被1个好人(数据)帮助了和3个坏人(数据

gensim LdaMulticore not multiprocessing?

混江龙づ霸主 提交于 2019-12-18 06:07:33
问题 When I run gensim's LdaMulticore model on a machine with 12 cores, using: lda = LdaMulticore(corpus, num_topics=64, workers=10) I get a logging message that says using serial LDA version on this node A few lines later, I see another loging message that says training LDA model using 10 processes When I run top, I see 11 python processes have been spawned, but 9 are sleeping, I.e. only one worker is active. The machine has 24 cores, and is not overwhelmed by any means. Why isn't LdaMulticore

gensim LdaMulticore not multiprocessing?

一笑奈何 提交于 2019-12-18 06:07:33
问题 When I run gensim's LdaMulticore model on a machine with 12 cores, using: lda = LdaMulticore(corpus, num_topics=64, workers=10) I get a logging message that says using serial LDA version on this node A few lines later, I see another loging message that says training LDA model using 10 processes When I run top, I see 11 python processes have been spawned, but 9 are sleeping, I.e. only one worker is active. The machine has 24 cores, and is not overwhelmed by any means. Why isn't LdaMulticore

Kaggle spooky NLP

只愿长相守 提交于 2019-12-17 18:54:41
https://www.kaggle.com/arthurtok/spooky-nlp-and-topic-modelling-tutorial 介绍 在本笔记本中,我将对这个Spooky Author数据集的主题建模进行非常基本的尝试。主题建模是我们尝试根据基础文档和文本语料库中的单词来发现抽象主题或“主题”的过程。我将在这里介绍两种标准的主题建模技术,第一种是称为潜在Dirichlet分配(LDA)的技术,第二种是非负矩阵分解(NMF)。我还将借此机会介绍一些自然语言处理基础知识,例如原始文本的标记化,词干化和向量化,这些也有望在用学习模型进行预测时派上用场。 该笔记本的概述如下: 探索性数据分析(EDA)和Wordclouds-通过生成简单的统计数据(例如,不同作者的词频)以及绘制一些词云(带有图像蒙版)来分析数据。 带有NLTK(自然语言工具包)的自然语言处理(NLP)-引入了基本的文本处理方法,例如标记化,停止单词删除,通过术语频率(TF)和反向文档频率(TF-IDF)提取文本和对向量进行矢量化 使用LDA和NNMF进行主题建模-实现潜在狄利克雷分配(LDA)和非负矩阵分解(NMF)的两种主题建模技术。 根据比赛页面,我们已经提供了三种不同的作者姓名缩写,这些姓名缩写与实际作者的映射如下: (如果单击其名称,则指向其Wikipedia页面配置文件的链接) EAP-埃德加

LDA model generates different topics everytime i train on the same corpus

試著忘記壹切 提交于 2019-12-17 09:34:11
问题 I am using python gensim to train an Latent Dirichlet Allocation (LDA) model from a small corpus of 231 sentences. However, each time i repeat the process, it generates different topics. Why does the same LDA parameters and corpus generate different topics everytime? And how do i stabilize the topic generation? I'm using this corpus (http://pastebin.com/WptkKVF0) and this list of stopwords (http://pastebin.com/LL7dqLcj) and here's my code: from gensim import corpora, models, similarities from