问题
I want to use GSDMM to assign topics to some tweets in my data set. The only examples I found (1 and 2) are not detailed enough. I was wondering if you know of a source (or care enough to make a small example) that shows how GSDMM is implemented using python.
回答1:
GSDMM (Gibbs Sampling Dirichlet Multinomial Mixture) is a short text clustering model. It is essentially a modified LDA (Latent Drichlet Allocation) which suppose that a document such as a tweet or any other text encompasses one topic.
GSDMM
LDA
Address: github.com/da03/GSDMM
import numpy as np
from scipy.sparse import lil_matrix
from scipy.sparse import find
import math
class GSDMM:
def __init__(self, n_topics, n_iter, random_state=910820, alpha=0.1, beta=0.1):
self.n_topics = n_topics
self.n_iter = n_iter
self.random_state = random_state
np.random.seed(random_state)
self.alpha = alpha
self.beta = beta
def fit(self, X):
alpha = self.alpha
beta = self.beta
D, V = X.shape
K = self.n_topics
N_d = X.sum(axis=1)
words_d = {}
for d in range(D):
words_d[d] = find(X[d,:])[1]
# initialization
N_k = np.zeros(K)
M_k = np.zeros(K)
N_k_w = lil_matrix((K, V), dtype=np.int32)
K_d = np.zeros(D)
for d in range(D):
k = np.random.choice(K, 1, p=[1.0/K]*K)[0]
K_d[d] = k
M_k[k] = M_k[k]+1
N_k[k] = N_k[k] + N_d[d]
for w in words_d[d]:
N_k_w[k, w] = N_k_w[k,w]+X[d,w]
for iter in range(self.n_iter):
print 'iter ', iter
for d in range(D):
k_old = K_d[d]
M_k[k_old] -= 1
N_k[k_old] -= N_d[d]
for w in words_d[d]:
N_k_w[k_old, w] -= X[d,w]
# sample k_new
log_probs = [0]*K
for k in range(K):
log_probs[k] += math.log(alpha+M_k[k])
for w in words_d[d]:
N_d_w = X[d,w]
for j in range(N_d_w):
log_probs[k] += math.log(N_k_w[k,w]+beta+j)
for i in range(N_d[d]):
log_probs[k] -= math.log(N_k[k]+beta*V+i)
log_probs = np.array(log_probs) - max(log_probs)
probs = np.exp(log_probs)
probs = probs/np.sum(probs)
k_new = np.random.choice(K, 1, p=probs)[0]
K_d[d] = k_new
M_k[k_new] += 1
N_k[k_new] += N_d[d]
for w in words_d[d]:
N_k_w[k_new, w] += X[d,w]
self.topic_word_ = N_k_w.toarray()
回答2:
I am experimenting with GSDMM as well and ran into the same problem, that there is just not much online (I was unable to find more than you, of course besides some papers using it). If you look at the code of the GSDMM GitHub repo you can see, that it is a pretty small repo with only a few functionalities. These are basically all used in the tutorial from towarddatascience, so I don´t think you are missing out on things.
If you have a specific question, feel free to ask!
Edit: If you follow the tutorial on towardsdatascience you will realize, that it is an inconsistent and not finished project. Some helper functions are missing and the algorithm is not correctly used. The author runs it with K=10
and ends up with 10 Clusters. If you increase K
(and you should) then the number of clusters would be higher than 10, so there is a little bit of cheating happening.
回答3:
I finally compiled my code for GSDMM and will put it here from scratch for others' use. Hope this helps. I have tried to comment on important parts:
#turning sentences into words
data_words =[]
for doc in data:
doc = doc.split()
data_words.append(doc)
#building bi-grams
bigram = gensim.models.Phrases(vocabulary, min_count=5, threshold=100)
bigram_mod = gensim.models.phrases.Phraser(bigram)
print('done!')
# Removing stop Words
stop_words.extend(['from', 'rt'])
def remove_stopwords(texts):
return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]
data_words_nostops = remove_stopwords(vocabulary)
# Form Bigrams
data_words_bigrams = [bigram_mod[doc] for doc in data_words_nostops]
#lemmatization
data_lemmatized = []
for sent in data_words_bigrams:
doc = nlp(" ".join(sent))
data_lemmatized.append([token.lemma_ for token in doc if token.pos_ in ['NOUN', 'ADJ', 'VERB', 'ADV']])
docs = data_lemmatized
vocab = set(x for doc in docs for x in doc)
# Train a new model
import random
random.seed(1000)
# Init of the Gibbs Sampling Dirichlet Mixture Model algorithm
mgp = MovieGroupProcess(K=10, alpha=0.1, beta=0.1, n_iters=30)
vocab = set(x for doc in docs for x in doc)
n_terms = len(vocab)
n_docs = len(docs)
# Fit the model on the data given the chosen seeds
y = mgp.fit(docs, n_terms)
def top_words(cluster_word_distribution, top_cluster, values):
for cluster in top_cluster:
sort_dicts =sorted(mgp.cluster_word_distribution[cluster].items(), key=lambda k: k[1], reverse=True)[:values]
print('Cluster %s : %s'%(cluster,sort_dicts))
print(' — — — — — — — — — ')
doc_count = np.array(mgp.cluster_doc_count)
print('Number of documents per topic :', doc_count)
print('*'*20)
# Topics sorted by the number of document they are allocated to
top_index = doc_count.argsort()[-10:][::-1]
print('Most important clusters (by number of docs inside):', top_index)
print('*'*20)
# Show the top 10 words in term frequency for each cluster
top_words(mgp.cluster_word_distribution, top_index, 10)
Hope this helps!
来源:https://stackoverflow.com/questions/62108771/a-practical-example-of-gsdmm-in-python