lda

Sagemaker LDA topic model - how to access the params of the trained model? Also is there a simple way to capture coherence

被刻印的时光 ゝ 提交于 2019-12-09 05:24:26
I'm new to Sagemaker and am running some tests to measure the performance of NTM and LDA on AWS compared with LDA mallet and native Gensim LDA model. I'm wanting to inspect the trained models on Sagemaker and look at stuff like what words have the highest contribution for each topic. And also to get a measure of model coherence. I have been able to successfully get what words have the highest contribution for each topic for NTM on Sagemaker by downloading the output file untarring it and unzipping to expose 3 files params, symbol.json and meta.json. However, when I try to do the same process

how to plot the results of a LDA

一世执手 提交于 2019-12-08 09:01:30
问题 There are quite some answers to this question. Not only on stack overflow but through internet. However, none could solve my problem. I have two problems I try to simulate a data for you df <- structure(list(Group = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2), var1 = c(2, 3, 1, 2, 3, 2, 3, 3, 5, 6, 7, 6, 8, 5, 5), var2 = c(9, 9, 9, 8, 7, 8, 9, 3, 2, 2, 1, 1, 2, 3, 3), var3 = c(6, 7, 6, 6, 5, 6, 7, 1, 2, 1, 2, 3, 1, 1, 2)), .Names = c("Group", "var1", "var2", "var3"), row.names = c(NA, -15L

How to get topic probability table from text2vec LDA

拟墨画扇 提交于 2019-12-08 04:13:22
问题 The LDA topic modeling in the text2vec package is amazing. It is indeed much faster than topicmodel However, I don't know how to get the probability of each document belongs to each topic as the example below: V1 V2 V3 V4 1 0.001025237 7.89E-05 7.89E-05 7.89E-05 2 0.002906977 0.002906977 0.014534884 0.002906977 3 0.003164557 0.003164557 0.003164557 0.003164557 4 7.21E-05 7.21E-05 0.000360334 7.21E-05 5 0.000804433 8.94E-05 8.94E-05 8.94E-05 6 5.63E-05 5.63E-05 5.63E-05 5.63E-05 7 0.001984127

Remove standard english language stop words in Stanford Topic Modeling Toolbox

这一生的挚爱 提交于 2019-12-08 03:54:45
问题 I am using Stanford Topic Modeling Toolbox 0.4.0 for LDA , I noticed that if I want to remove standard english language stop words, I can use a StopWordFilter("en") as the last step the tokenizer, but how do I use it? import scalanlp.io._; import scalanlp.stage._; import scalanlp.stage.text._; import scalanlp.text.tokenize._; import scalanlp.pipes.Pipes.global._; import edu.stanford.nlp.tmt.stage._; import edu.stanford.nlp.tmt.model.lda._; import edu.stanford.nlp.tmt.model.llda._; val source

Sagemaker LDA topic model - how to access the params of the trained model? Also is there a simple way to capture coherence

白昼怎懂夜的黑 提交于 2019-12-08 03:14:59
问题 I'm new to Sagemaker and am running some tests to measure the performance of NTM and LDA on AWS compared with LDA mallet and native Gensim LDA model. I'm wanting to inspect the trained models on Sagemaker and look at stuff like what words have the highest contribution for each topic. And also to get a measure of model coherence. I have been able to successfully get what words have the highest contribution for each topic for NTM on Sagemaker by downloading the output file untarring it and

LDA+可视化

瘦欲@ 提交于 2019-12-07 21:24:33
from nltk.tokenize import RegexpTokenizer from stop_words import get_stop_words from nltk.stem.porter import PorterStemmer from gensim import corpora, models import gensim import csv import jieba import codecs from mpl_toolkits.mplot3d import axes3d import matplotlib.pyplot as plt import pyLDAvis.gensim from gensim import corpora from gensim.models import LdaModel def is_number(s): try: float(s) return True except ValueError: pass try: import unicodedata unicodedata.numeric(s) return True except (TypeError, ValueError): pass return False info = [] def data_g(filename): csv_reader = csv.reader

LDA and topic model

孤者浪人 提交于 2019-12-07 10:23:06
问题 I have studied LDA and Topic model for several weeks.But due to my poor mathematics ability, i can not fully understand its inner algorithms.I have used the GibbsLDA implementation, input a lot of documents, and set topic number as 100, i got a file named "final.theta" which stores the topic proportion of each topic in each document.This result is good, i can use the topic proportion to do many other things. But when i tried Blei's C language implementation on LDA, i only got a file named

[翻译] 在Python中使用LDA处理文本

佐手、 提交于 2019-12-07 08:06:34
说明: 原文: http://chrisstrelioff.ws/sandbox/2014/11/13/getting_started_with_latent_dirichlet_allocation_in_python.html 本文包含了上文的主要内容。 关于LDA: LDA漫游指南 使用的python库lda来自: https://github.com/ariddell/lda 。 gensim 库也含有lda相关函数。 安装 $ pip install lda --user 示例 from __future__ import division, print_function import numpy as np import lda import lda.datasets # document-term matrix X = lda.datasets.load_reuters() print("type(X): {}".format(type(X))) print("shape: {}\n".format(X.shape)) print(X[:5, :5]) '''输出: type(X): <type 'numpy.ndarray'> shape: (395L, 4258L) [[ 1 0 1 0 0] [ 7 0 2 0 0] [ 0 0 0 1 10] [ 6 0 1

Understanding Spark MLlib LDA input format

会有一股神秘感。 提交于 2019-12-06 13:21:32
问题 I am trying to implement LDA using Spark MLlib. But I am having difficulty understanding input format. I was able to run its sample implementation to take input from a file which contains only number's as shown : 1 2 6 0 2 3 1 1 0 0 3 1 3 0 1 3 0 0 2 0 0 1 1 4 1 0 0 4 9 0 1 2 0 2 1 0 3 0 0 5 0 2 3 9 3 1 1 9 3 0 2 0 0 1 3 4 2 0 3 4 5 1 1 1 4 0 2 1 0 3 0 0 5 0 2 2 9 1 1 1 9 2 1 2 0 0 1 3 4 4 0 3 4 2 1 3 0 0 0 2 8 2 0 3 0 2 0 2 7 2 1 1 1 9 0 2 2 0 0 3 3 4 1 0 0 4 5 1 3 0 1 0 I followed http:/

Why isn't Stanford Topic Modeling Toolbox producing lda-output directory?

我是研究僧i 提交于 2019-12-06 12:24:54
问题 I tried to run this code from github (following the 1-2-3 steps) which identifies 30 topics in Sarah Palin's 14,500 emails. The topics discovered by the author are here. However, Stanford Topic Modeling Toolbox is not producing lda-output directory for me. It produced the lda-86a58136-30-2b1a90a6, but the summary.txt in this folder only shows the initial assignment of topics, not the final one. Any idea how to produce lda-output directory with the final summary of topics discovered? Thanks in