topic-modeling | 易学教程

Topic modelling, but with known topics?

阅读更多关于 Topic modelling, but with known topics?

问题 Okay, so usually topic models (such as LDA, pLSI, etc.) are used to infer topics that may be present in a set of documents, in an unsupervised fashion. I would like to know if anyone has any ideas as to how I can shoehorn my problem into an LDA framework, as there are very good tools available to solve LDA problems. For the sake of being thorough, I have the following pieces of information as input: A set of documents (segments of DNA from one organism, where each segment is a document) A

Gensim: KeyError: “word not in vocabulary”

阅读更多关于 Gensim: KeyError: “word not in vocabulary”

问题 I have a trained Word2vec model using Python's Gensim Library. I have a tokenized list as below. The vocab size is 34 but I am just giving few out of 34: b = ['let', 'know', 'buy', 'someth', 'featur', 'mashabl', 'might', 'earn', 'affili', 'commiss', 'fifti', 'year', 'ago', 'graduat', '21yearold', 'dustin', 'hoffman', 'pull', 'asid', 'given', 'one', 'piec', 'unsolicit', 'advic', 'percent', 'buy'] Model model = gensim.models.Word2Vec(b,min_count=1,size=32) print(model) ### prints: Word2Vec

How do I print lda topic model and the word cloud of each of the topics

阅读更多关于 How do I print lda topic model and the word cloud of each of the topics

问题 from nltk.tokenize import RegexpTokenizer from stop_words import get_stop_words from gensim import corpora, models import gensim import os from os import path from time import sleep import matplotlib.pyplot as plt import random from wordcloud import WordCloud, STOPWORDS tokenizer = RegexpTokenizer(r'\w+') en_stop = set(get_stop_words('en')) with open(os.path.join('c:\users\kaila\jobdescription.txt')) as f: Reader = f.read() Reader = Reader.replace("will", " ") Reader = Reader.replace("please"

Topic modelling, but with known topics?

阅读更多关于 Topic modelling, but with known topics?

Okay, so usually topic models (such as LDA, pLSI, etc.) are used to infer topics that may be present in a set of documents, in an unsupervised fashion. I would like to know if anyone has any ideas as to how I can shoehorn my problem into an LDA framework, as there are very good tools available to solve LDA problems. For the sake of being thorough, I have the following pieces of information as input: A set of documents (segments of DNA from one organism, where each segment is a document) A document can only have one topic in this scenario A set of topics (segments of DNA from other organisms)

Simple Python implementation of collaborative topic modeling?

阅读更多关于 Simple Python implementation of collaborative topic modeling?

问题 I came across these 2 papers which combined collaborative filtering (Matrix factorization) and Topic modelling (LDA) to recommend users similar articles/posts based on topic terms of post/articles that users are interested in. The papers (in PDF) are: " Collaborative Topic Modeling for Recommending Scientific Articles " and " Collaborative Topic Modeling for Recommending GitHub Repositories " The new algorithm is called collaborative topic regression . I was hoping to find some python code

Run cvb in mahout 0.8

阅读更多关于 Run cvb in mahout 0.8

The current Mahout 0.8-SNAPSHOT includes a Collapsed Variational Bayes (cvb) version for Topic Modeling and removed the Latent Dirichlet Analysis (lda) approach, because cvb can be parallelized way better. Unfortunately there is only documentation for lda on how to run an example and generate meaningful output. Thus, I want to: preprocess some texts correctly run the cvb0_local version of cvb inspect the results by looking at the top n words in each of the generated topics So here are the subsequent Mahout commands I had to call in a linux shell to do it. $MAHOUT_HOME points to my mahout/bin

How do I print lda topic model and the word cloud of each of the topics

阅读更多关于 How do I print lda topic model and the word cloud of each of the topics

from nltk.tokenize import RegexpTokenizer from stop_words import get_stop_words from gensim import corpora, models import gensim import os from os import path from time import sleep import matplotlib.pyplot as plt import random from wordcloud import WordCloud, STOPWORDS tokenizer = RegexpTokenizer(r'\w+') en_stop = set(get_stop_words('en')) with open(os.path.join('c:\users\kaila\jobdescription.txt')) as f: Reader = f.read() Reader = Reader.replace("will", " ") Reader = Reader.replace("please", " ") texts = unicode(Reader, errors='replace') tdm = [] raw = texts.lower() tokens = tokenizer

how to get a probability distribution for a topic in mallet?

阅读更多关于 how to get a probability distribution for a topic in mallet?

问题 Using mallet I can get a specific number of topics and their words. How can I make sure topic words make a probability distribution (ie sum to one)? For example if I run it as bellow, how can I use the outputs given by mallet to make sure probabilities of topic words for topic 0 adds up to 1? mallet train-topics --input text.vectors --output-topic-keys topics.txt --output-doc-topics doc_comp.txt --topic-word-weights-file weights.txt --num-top-words 50 --word-topic-counts-file counts.txt --num

Simple Python implementation of collaborative topic modeling?

阅读更多关于 Simple Python implementation of collaborative topic modeling?

I came across these 2 papers which combined collaborative filtering (Matrix factorization) and Topic modelling (LDA) to recommend users similar articles/posts based on topic terms of post/articles that users are interested in. The papers (in PDF) are: " Collaborative Topic Modeling for Recommending Scientific Articles " and " Collaborative Topic Modeling for Recommending GitHub Repositories " The new algorithm is called collaborative topic regression . I was hoping to find some python code that implemented this but to no avail. This might be a long shot but can someone show a simple python

Memory error in python using numpy array

阅读更多关于 Memory error in python using numpy array

I am getting the following error for this code: model = lda.LDA(n_topics=15, n_iter=50, random_state=1) model.fit(X) topic_word = model.topic_word_ print("type(topic_word): {}".format(type(topic_word))) print("shape: {}".format(topic_word.shape)) print ("\n") n = 15 doc_topic=model.doc_topic_ for i in range(15): print("{} (top topic: {})".format(titles[i], doc_topic[0][i].argmax())) topic_csharp=np.zeros(shape=[1,n]) np.copyto(topic_csharp,doc_topic[0][i]) for i, topic_dist in enumerate(topic_word): topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n+1):-1] print('*Topic {}\n- {}'