topic-modeling | 易学教程

error Installing topicmodels in R Ubuntu

阅读更多关于 error Installing topicmodels in R Ubuntu

问题 I am getting error while installing topicmodels package in R . on running install.packages("topicmodels",dependencies=TRUE) following are the last few lines I am getting. Please help. My R version is 3.1.3 . g++ -I/usr/share/R/include -DNDEBUG -fpic -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Wformat-security -Werror=format-security -D_FORTIFY_SOURCE=2 -g -c utilities.cpp -o utilities.o gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -fpic -g -O2 -fstack-protector --param=ssp

Python: clustering similar words based on word2vec

阅读更多关于 Python: clustering similar words based on word2vec

问题 This might be the naive question which I am about to ask. I have a tokenized corpus on which I have trained Gensim's Word2vec model. The code is as below site = Article("http://www.datasciencecentral.com/profiles/blogs/blockchain-and-artificial-intelligence-1") site.download() site.parse() def clean(doc): stop_free = " ".join([i for i in word_tokenize(doc.lower()) if i not in stop]) punc_free = ''.join(ch for ch in stop_free if ch not in exclude) normalized = " ".join(lemma.lemmatize(word)

Memory efficient LDA training using gensim library

阅读更多关于 Memory efficient LDA training using gensim library

问题 Today I just started writing an script which trains LDA models on large corpora (minimum 30M sentences) using gensim library. Here is the current code that I am using: from gensim import corpora, models, similarities, matutils def train_model(fname): logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) dictionary = corpora.Dictionary(line.lower().split() for line in open(fname)) print "DOC2BOW" corpus = [dictionary.doc2bow(line.lower().split()) for line

Topic Modelling - Assign human readable labels to topic

阅读更多关于 Topic Modelling - Assign human readable labels to topic

问题 I want to assign human readable labels to the results of my topic modelling. Is there any software library or data set that I can use that takes these key words as an input, and returns a title to describe the topic. Example: Input: ["Church","Priest","God","Prayer"] Output: "Religion" Note: I want automatic label creation - Not manual like others have asked before. 回答1: See this paper by Jey Han Lau. He describes how to automatically generate labels using different sources and features. We

How to implement Latent Dirichlet Allocation in regression analysis

阅读更多关于 How to implement Latent Dirichlet Allocation in regression analysis

问题 I have a dataset consisting of hotel reviews, ratings, and other features such as traveler type, and word count of the review. I want to perform topic modeling (LDA) and use the topics derived from the reviews as well as other features to identify the features that most affects the ratings (ratings as the dependent variable). If I want to use linear regression to do this, does this mean I would have to label each review with the topics derived? Is there a way to do this in R or will I have to

How to use Topic Model (LDA) output to match and retrieve new, same-topic documents

阅读更多关于 How to use Topic Model (LDA) output to match and retrieve new, same-topic documents

问题 I am using a LDA model on a corpus to learn the topics covered in it. I am using the gensim package (e.g., gensim.models.ldamodel.LdaModel); can easily use other versions of LDA if necessary. My question is what is the most efficient way to use the parameterized model and/or topic words or topic IDs to find and retrieve new documents that contain the topic? Concretely, I want to scrape a media API to find new articles (out-of-sample documents) that relate to my topics contained in my original

Latent Dirichlet Allocation with prior topic words

阅读更多关于 Latent Dirichlet Allocation with prior topic words

问题 Context I'm trying to extract topics from a set of texts using Latent Dirichlet allocation from Scikit-Learn's decomposition module. This works really well, except for the quality of topic words found/selected. In a article by Li et al (2017), the authors describe using prior topic words as input for the LDA. They manually choose 4 topics and the main words associated/belong to these topics. For these words they set the default value to a high number for the associated topic and 0 for the

R Supervised Latent Dirichlet Allocation Package

阅读更多关于 R Supervised Latent Dirichlet Allocation Package

问题 I'm using this LDA package for R. Specifically I am trying to do supervised latent dirichlet allocation (slda). In the linked package, there's an slda.em function. However what confuses me is that it asks for alpha, eta and variance parameters. As far as I understand, I thought these parameters are unknowns in the model. So my question is, did the author of the package mean to say that these are initial guesses for the parameters? If yes, there doesn't seem to be a way of accessing them from

How to evaluate the best K for LDA using Mallet?

阅读更多关于 How to evaluate the best K for LDA using Mallet?

问题 I am using Mallet api to extract topic from twitter data and I have already extracted topics which are seems good topic. But I am facing problem to estimating K. For example I fixed K value from 10 to 100. So, I have taken different number of topics from the data. But, now I would like to estimate which K is best. There are some algorithm I know as Perplexity Empirical likelihood Marginal likelihood (Harmonic mean method) Silhouette I found a method model.estimate() which may be used to

Trying to remove words from a DocumentTermMatrix in order to use topicmodels

阅读更多关于 Trying to remove words from a DocumentTermMatrix in order to use topicmodels

问题 So, I am trying to use the topicmodels package for R (100 topics on a corpus of ~6400 documents, which are each ~1000 words). The process runs and then dies, I think because it is running out of memory. So I try to shrink the size of the document term matrix that the lda() function takes as input; I figure I can do that do using the minDocFreq function when I generate my document term matrices. But when I use it, it doesn't seem to make any difference. Here is some code: Here is the relevant