A practical example of GSDMM in python?

问题

I want to use GSDMM to assign topics to some tweets in my data set. The only examples I found (1 and 2) are not detailed enough. I was wondering if you know of a source (or care enough to make a small example) that shows how GSDMM is implemented using python.

回答1:

GSDMM (Gibbs Sampling Dirichlet Multinomial Mixture) is a short text clustering model. It is essentially a modified LDA (Latent Drichlet Allocation) which suppose that a document such as a tweet or any other text encompasses one topic.

GSDMM

LDA

Address: github.com/da03/GSDMM

import numpy as np
from scipy.sparse import lil_matrix
from scipy.sparse import find
import math

class GSDMM:
    def __init__(self, n_topics, n_iter, random_state=910820, alpha=0.1, beta=0.1):
        self.n_topics = n_topics
        self.n_iter = n_iter
        self.random_state = random_state
        np.random.seed(random_state)
        self.alpha = alpha
        self.beta = beta
    def fit(self, X):
        alpha = self.alpha
        beta = self.beta

        D, V = X.shape
        K = self.n_topics

        N_d = X.sum(axis=1)
        words_d = {}
        for d in range(D):
            words_d[d] = find(X[d,:])[1]

        # initialization
        N_k = np.zeros(K)
        M_k = np.zeros(K)
        N_k_w = lil_matrix((K, V), dtype=np.int32)

        K_d = np.zeros(D)

        for d in range(D):
            k = np.random.choice(K, 1, p=[1.0/K]*K)[0]
            K_d[d] = k
            M_k[k] = M_k[k]+1
            N_k[k] = N_k[k] + N_d[d]
            for w in words_d[d]:
                N_k_w[k, w] = N_k_w[k,w]+X[d,w]

        for iter in range(self.n_iter):
            print 'iter ', iter
            for d in range(D):
                k_old = K_d[d]
                M_k[k_old] -= 1
                N_k[k_old] -= N_d[d]
                for w in words_d[d]:
                    N_k_w[k_old, w] -= X[d,w]
                # sample k_new
                log_probs = [0]*K
                for k in range(K):
                    log_probs[k] += math.log(alpha+M_k[k])
                    for w in words_d[d]:
                        N_d_w = X[d,w]
                        for j in range(N_d_w):
                            log_probs[k] += math.log(N_k_w[k,w]+beta+j)
                    for i in range(N_d[d]):
                        log_probs[k] -= math.log(N_k[k]+beta*V+i)
                log_probs = np.array(log_probs) - max(log_probs)
                probs = np.exp(log_probs)
                probs = probs/np.sum(probs)
                k_new = np.random.choice(K, 1, p=probs)[0]
                K_d[d] = k_new
                M_k[k_new] += 1
                N_k[k_new] += N_d[d]
                for w in words_d[d]:
                    N_k_w[k_new, w] += X[d,w]
        self.topic_word_ = N_k_w.toarray()

回答2:

I am experimenting with GSDMM as well and ran into the same problem, that there is just not much online (I was unable to find more than you, of course besides some papers using it). If you look at the code of the GSDMM GitHub repo you can see, that it is a pretty small repo with only a few functionalities. These are basically all used in the tutorial from towarddatascience, so I don´t think you are missing out on things.

If you have a specific question, feel free to ask!

Edit: If you follow the tutorial on towardsdatascience you will realize, that it is an inconsistent and not finished project. Some helper functions are missing and the algorithm is not correctly used. The author runs it with K=10 and ends up with 10 Clusters. If you increase K (and you should) then the number of clusters would be higher than 10, so there is a little bit of cheating happening.

回答3:

I finally compiled my code for GSDMM and will put it here from scratch for others' use. Hope this helps. I have tried to comment on important parts:

#turning sentences into words

data_words =[]
for doc in data:
    doc = doc.split()
    data_words.append(doc)


#building bi-grams 

bigram = gensim.models.Phrases(vocabulary, min_count=5, threshold=100) 

bigram_mod = gensim.models.phrases.Phraser(bigram)

print('done!')



# Removing stop Words

stop_words.extend(['from', 'rt'])

def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

data_words_nostops = remove_stopwords(vocabulary)


# Form Bigrams
data_words_bigrams = [bigram_mod[doc] for doc in data_words_nostops]



#lemmatization
data_lemmatized = []
for sent in data_words_bigrams:
    doc = nlp(" ".join(sent)) 
    data_lemmatized.append([token.lemma_ for token in doc if token.pos_ in ['NOUN', 'ADJ', 'VERB', 'ADV']])

docs = data_lemmatized
vocab = set(x for doc in docs for x in doc)

# Train a new model 
import random
random.seed(1000)
# Init of the Gibbs Sampling Dirichlet Mixture Model algorithm
mgp = MovieGroupProcess(K=10, alpha=0.1, beta=0.1, n_iters=30)

vocab = set(x for doc in docs for x in doc)
n_terms = len(vocab)
n_docs = len(docs)

# Fit the model on the data given the chosen seeds
y = mgp.fit(docs, n_terms)

def top_words(cluster_word_distribution, top_cluster, values):
    for cluster in top_cluster:
        sort_dicts =sorted(mgp.cluster_word_distribution[cluster].items(), key=lambda k: k[1], reverse=True)[:values]
        print('Cluster %s : %s'%(cluster,sort_dicts))
        print(' — — — — — — — — — ')

doc_count = np.array(mgp.cluster_doc_count)
print('Number of documents per topic :', doc_count)
print('*'*20)

# Topics sorted by the number of document they are allocated to
top_index = doc_count.argsort()[-10:][::-1]
print('Most important clusters (by number of docs inside):', top_index)
print('*'*20)


# Show the top 10 words in term frequency for each cluster 

top_words(mgp.cluster_word_distribution, top_index, 10)

Hope this helps!

来源：https://stackoverflow.com/questions/62108771/a-practical-example-of-gsdmm-in-python

标签

python

lda

topic-modeling

tweets