lda

IndexError while using Gensim package for LDA Topic Modelling

馋奶兔 提交于 2020-01-06 04:05:39
问题 I have a total of 54892 documents which have 360331 unique tokens. The length of the dictionary is 88. mm = corpora.MmCorpus('PRC.mm') dictionary = corpora.Dictionary('PRC.dict') lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=dictionary, num_topics=50, update_every=0, chunksize=19188, passes=650) Whenever I run this script I get this error: Traceback (most recent call last): File "C:\Users\modelDeTopics.py", line 19, in <module> lda = gensim.models.ldamodel.LdaModel(corpus=mm,

IndexError while using Gensim package for LDA Topic Modelling

拥有回忆 提交于 2020-01-06 04:04:06
问题 I have a total of 54892 documents which have 360331 unique tokens. The length of the dictionary is 88. mm = corpora.MmCorpus('PRC.mm') dictionary = corpora.Dictionary('PRC.dict') lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=dictionary, num_topics=50, update_every=0, chunksize=19188, passes=650) Whenever I run this script I get this error: Traceback (most recent call last): File "C:\Users\modelDeTopics.py", line 19, in <module> lda = gensim.models.ldamodel.LdaModel(corpus=mm,

Concept Behind The Transformed Data Of LDA Model

◇◆丶佛笑我妖孽 提交于 2020-01-05 03:36:18
问题 My question is related to Latent Dirichlet Allocation . Suppose we apply LDA on our dataset, then apply fit transform on that. the output is a matrix that is a collection of five documents. Each document consists of three topics. othe output is below: [[ 0.0922935 0.09218227 0.81552423] [ 0.81396651 0.09409428 0.09193921] [ 0.05265482 0.05240119 0.89494398] [ 0.05278187 0.89455775 0.05266038] [ 0.85209554 0.07338382 0.07452064]] So, this is the matrix that will be sent to a classification

LDA: Why sampling for inference of a new document?

风流意气都作罢 提交于 2020-01-04 06:03:50
问题 Given a standard LDA model with few 1000 topics and few millions of documents, trained with Mallet / collapsed Gibbs sampler: When inferring a new document: Why not just skip sampling and simply use the term-topic counts of the model to determine the topic assignments of the new document? I understand that applying the Gibbs sampling on the new document is taking into account the topic mixture of the new document which in turn influence how topics are composed (beta, term-freq. distributions)

LDA: Why sampling for inference of a new document?

江枫思渺然 提交于 2020-01-04 06:03:04
问题 Given a standard LDA model with few 1000 topics and few millions of documents, trained with Mallet / collapsed Gibbs sampler: When inferring a new document: Why not just skip sampling and simply use the term-topic counts of the model to determine the topic assignments of the new document? I understand that applying the Gibbs sampling on the new document is taking into account the topic mixture of the new document which in turn influence how topics are composed (beta, term-freq. distributions)

LDA: Why sampling for inference of a new document?

有些话、适合烂在心里 提交于 2020-01-04 06:02:49
问题 Given a standard LDA model with few 1000 topics and few millions of documents, trained with Mallet / collapsed Gibbs sampler: When inferring a new document: Why not just skip sampling and simply use the term-topic counts of the model to determine the topic assignments of the new document? I understand that applying the Gibbs sampling on the new document is taking into account the topic mixture of the new document which in turn influence how topics are composed (beta, term-freq. distributions)

Graph only partially displaying in Jupyter Notebook output

感情迁移 提交于 2020-01-03 15:36:37
问题 I am trying to get a PyLDAvis graph that looks like the 2 shown in this link, that you can see right away (Intertopic Distance Map and Top 30 Most Salient Terms): http://nbviewer.jupyter.org/github/bmabey/hacker_news_topic_modelling/blob/master/HN%20Topic%20Model%20Talk.ipynb My code does display it but only partially, I can only see 1 cluster on the left and like 5-6 terms on the right, the rest gets cut off (there should be many clusters and 30 words shown). This is the code I have: import

【LDA】线性判别式分析

邮差的信 提交于 2020-01-03 06:57:05
1. LDA是什么 线性判别式分析(Linear Discriminant Analysis),简称为LDA。也称为Fisher线性判别(Fisher Linear Discriminant,FLD),是模式识别的经典 算法 ,在1996年由Belhumeur引入模式识别和人工 智能 领域。 基本思想是将高维的模式样本投影到最佳鉴别矢量空间,以达到抽取分类信息和压缩特征空间维数的效果,投影后保证模式样本在新的子空间有 最大的类间距离 和 最小的类内距离 ,即模式在该空间中有最佳的可分离性。 LDA的目标: 可以看到两个类别,一个绿色类别,一个红色类别。左图是两个类别的原始数据,现在要求将数据从二维降维到一维。直接投影到x1轴或者x2轴,不同类别之间会有重复,导致分类效果下降。右图映射到的直线就是用LDA方法计算得到的,可以看到,红色类别和绿色类别在映射之后之间的距离是最大的,而且每个类别内部点的离散程度是最小的(或者说聚集程度是最大的)。 2. LDA的一些说明 第一,降维后的维度是多少? PCA降维是直接和数据维度相关的,比如原始数据是n维的,那么PCA后,可以任意选取1维、2维,一直到n维都行(当然是对应特征值大的那些)。 LDA 降维是直接和类别的个数相关的,与数据本身的维度没关系,比如原始数据是n维的,一共有C个类别,那么LDA降维之后,一般就是1维,2维到C-1维进行选择

Understanding LDA Transformed Corpus in Gensim

*爱你&永不变心* 提交于 2020-01-02 06:53:12
问题 I tried to examine the contents of the BOW corpus vs. the LDA[BOW Corpus] (transformed by LDA model trained on that corpus with, say, 35 topics) I found the following output: DOC 1 : [(1522, 1), (2028, 1), (2082, 1), (6202, 1)] LDA 1 : [(29, 0.80571428571428572)] DOC 2 : [(1522, 1), (5364, 1), (6202, 1), (6661, 1), (6983, 1)] LDA 2 : [(29, 0.83809523809523812)] DOC 3 : [(3079, 1), (3395, 1), (4874, 1)] LDA 3 : [(34, 0.75714285714285712)] DOC 4 : [(1482, 1), (2806, 1), (3988, 1)] LDA 4 : [(22,

Latent Dirichlet Allocation (LDA) implementation

纵饮孤独 提交于 2020-01-01 07:25:09
问题 does someone know if there exists some implementation of LDA algorithm (no matter if library or application) for Win32 platform? Maybe in C/C++ or other language that can be compiled? 回答1: Well, honestly I just googled LDA because I was curious of what it was, and the second hit was a C implementation of LDA. It compiles fine with gcc, though some warnings show up. I don't know if it's pure ANSI C or not, but considering that there is gcc for windows available, this shouldn't be a problem. If