topic-modeling

Topic Modelling by Group using LDA in R

穿精又带淫゛_ 提交于 2020-02-01 09:36:44
问题 I am stuck at one problem. I am trying to categorize sentences into topics using LDA. I have done it, however the problem is: LDA is working on whole dataset and giving me topic terminologies across the dataset. I want to get the topic terminologies by group in Dataset. So my data looks like this: Comment Division Smooth execution of Regional Administration in my absence. Well done. Finance Job well done in completing CPs and making the facility available well in time. Finance Good Job

How to interpret Sklearn LDA perplexity score. Why it always increase as number of topics increase?

牧云@^-^@ 提交于 2020-01-23 01:38:07
问题 I try to find the optimal number of topics using LDA model of sklearn. To do this I calculate perplexity by referring code on https://gist.github.com/tmylk/b71bf7d3ec2f203bfce2. But when I increase the number of topics, perplexity always increase irrationally. Am I wrong in implementations or just it gives right values? from __future__ import print_function from time import time from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer from sklearn.decomposition import NMF,

Using topic modeling Java toolkit

这一生的挚爱 提交于 2020-01-17 09:05:11
问题 I'm working on text classification and I want to use Topic models (LDA). My corpus consists of at least 24, 000 Persian news documents. each doc in the corpus is in format of (keyword, weight) pairs extracted from the news. I saw two Java toolkits: mallet and lingpipe. I've read mallet tutorial on importing the data and it gets data in plain text, not the format that I have. is there any way that I could change it? Also read a little about the lingpipe, the example from tutorial was using

can use package interactively, but Rscript gives errors

ぐ巨炮叔叔 提交于 2020-01-15 06:03:44
问题 I'm using the "topicmodels" package in R. Everything works fine interactively, but if I run the exact same commands using Rscript , I get errors. The first error (which I solved) is that R didn't know what the is() function was. I solved this by importing the "methods" package. Apparently, Rscript doesn't import this automatically, even though interactive R does, and this caused a problem with is(). Problem solved. However, I am now stuck at a different error, which I can't figure out. I am

Spark LDA woes - prediction and OOM questions

无人久伴 提交于 2020-01-13 13:05:29
问题 I'm evaluating Spark 1.6.0 to build and predict against large (millions of docs, millions of features, thousands of topics) LDA models, something I can accomplish pretty easily with Yahoo! LDA. Starting small, following the Java examples, I built a 100K doc/600K feature/250 topic/100 iteration model using the Distributed model/EM optimizer. The model built fine and the resulting topics were coherent. I then wrote a wrapper around the new ​single-document prediction routine (SPARK-10809; which

Visualizing an LDA model, using Python

℡╲_俬逩灬. 提交于 2020-01-13 08:27:08
问题 I have a LDA model with the 10 most common topics in 10K documents. Now it's just an overview of the words with corresponding probability distribution for each topic. I was wondering if there is something available for python to visualize these topics? 回答1: pyLDAvis looks reasonably good. There's also Termite developed by Jason Chuang of Stanford. 回答2: There some visulizations you can choise. In the topic of Visualizing topic models, the visualization could be implemented with, D3 and Django

Plot the evolution of an LDA topic across time

元气小坏坏 提交于 2020-01-13 05:59:29
问题 I'd like to plot how the proportion of a particular topic changes over time, but I've been having some trouble isolating a single topic and plotting over time, especially for plotting multiple groups of documents separately (let's create two groups to compare - journals A and B). I've saved dates associated with these journals in a function called dateConverter . Here's what I have so far (with much thanks to @scoa): library(tm); library(topicmodels); txtfolder <- "~/path/to/documents/"

TopicModel: How to query documents by topic model “topic”?

家住魔仙堡 提交于 2020-01-12 08:29:09
问题 Below I created a full reproducible example to compute the topic model for a given DataFrame. import numpy as np import pandas as pd data = pd.DataFrame({'Body': ['Here goes one example sentence that is generic', 'My car drives really fast and I have no brakes', 'Your car is slow and needs no brakes', 'Your and my vehicle are both not as fast as the airplane']}) from sklearn.decomposition import LatentDirichletAllocation from sklearn.feature_extraction.text import CountVectorizer vectorizer =

IndexError while using Gensim package for LDA Topic Modelling

馋奶兔 提交于 2020-01-06 04:05:39
问题 I have a total of 54892 documents which have 360331 unique tokens. The length of the dictionary is 88. mm = corpora.MmCorpus('PRC.mm') dictionary = corpora.Dictionary('PRC.dict') lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=dictionary, num_topics=50, update_every=0, chunksize=19188, passes=650) Whenever I run this script I get this error: Traceback (most recent call last): File "C:\Users\modelDeTopics.py", line 19, in <module> lda = gensim.models.ldamodel.LdaModel(corpus=mm,

IndexError while using Gensim package for LDA Topic Modelling

拥有回忆 提交于 2020-01-06 04:04:06
问题 I have a total of 54892 documents which have 360331 unique tokens. The length of the dictionary is 88. mm = corpora.MmCorpus('PRC.mm') dictionary = corpora.Dictionary('PRC.dict') lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=dictionary, num_topics=50, update_every=0, chunksize=19188, passes=650) Whenever I run this script I get this error: Traceback (most recent call last): File "C:\Users\modelDeTopics.py", line 19, in <module> lda = gensim.models.ldamodel.LdaModel(corpus=mm,