I am trying to obtain the optimal number of topics for an LDA-model within Gensim. One method I found is to calculate the log likelihood for each model and compare each agai
Although I cannot comment on Gensim in particular I can weigh in with some general advice for optimising your topics.
As you stated, using log likelihood is one method. Another option is to keep a set of documents held out from the model generation process and infer topics over them when the model is complete and check if it makes sense.
A completely different method you could try is a hierarchical Dirichlet process, this method can find the number of topics in the corpus dynamically without being specified.
There are many papers on how to best specify parameters and evaluate your topic model, depending on your experience level these may or may not be good for you:
Rethinking LDA: Why Priors Matter, Wallach, H.M., Mimno, D. and McCallum, A.
Evaluation Methods for Topic Models, Wallach H.M., Murray, I., Salakhutdinov, R. and Mimno, D.
Also, here is the paper about the hierarchical Dirichlet process:
Hierarchical Dirichlet Processes, Teh, Y.W., Jordan, M.I., Beal, M.J. and Blei, D.M.
A general rule of thumb is to create LDA models across different topic numbers, and then check the Jaccard similarity and coherence for each. Coherence in this case measures a single topic by the degree of semantic similarity between high scoring words in the topic (do these words co-occur across the text corpus). The following will give a strong intuition for the optimal number of topics. This should be a baseline before jumping to the hierarchical Dirichlet process, as that technique has been found to have issues in practical applications.
Start by creating dictionaries for models and topic words for the various topic numbers you want to consider, where in this case corpus
is the cleaned tokens, num_topics
is a list of topics you want to consider, and num_words
is the number of top words per topic that you want to be considered for the metrics:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from gensim.models import LdaModel, CoherenceModel
from gensim import corpora
dirichlet_dict = corpora.Dictionary(corpus)
bow_corpus = [dirichlet_dict.doc2bow(text) for text in corpus]
# Considering 1-15 topics, as the last is cut off
num_topics = list(range(16)[1:])
num_keywords = 15
LDA_models = {}
LDA_topics = {}
for i in num_topics:
LDA_models[i] = LdaModel(corpus=bow_corpus,
id2word=dirichlet_dict,
num_topics=i,
update_every=1,
chunksize=len(bow_corpus),
passes=20,
alpha='auto',
random_state=42)
shown_topics = LDA_models[i].show_topics(num_topics=i,
num_words=num_keywords,
formatted=False)
LDA_topics[i] = [[word[0] for word in topic[1]] for topic in shown_topics]
Now create a function to derive the Jaccard similarity of two topics:
def jaccard_similarity(topic_1, topic_2):
"""
Derives the Jaccard similarity of two topics
Jaccard similarity:
- A statistic used for comparing the similarity and diversity of sample sets
- J(A,B) = (A ∩ B)/(A ∪ B)
- Goal is low Jaccard scores for coverage of the diverse elements
"""
intersection = set(topic_1).intersection(set(topic_2))
union = set(topic_1).union(set(topic_2))
return float(len(intersection))/float(len(union))
Use the above to derive the mean stability across topics by considering the next topic:
LDA_stability = {}
for i in range(0, len(num_topics)-1):
jaccard_sims = []
for t1, topic1 in enumerate(LDA_topics[num_topics[i]]): # pylint: disable=unused-variable
sims = []
for t2, topic2 in enumerate(LDA_topics[num_topics[i+1]]): # pylint: disable=unused-variable
sims.append(jaccard_similarity(topic1, topic2))
jaccard_sims.append(sims)
LDA_stability[num_topics[i]] = jaccard_sims
mean_stabilities = [np.array(LDA_stability[i]).mean() for i in num_topics[:-1]]
gensim has a built in model for topic coherence (this uses the 'c_v'
option):
coherences = [CoherenceModel(model=LDA_models[i], texts=corpus, dictionary=dirichlet_dict, coherence='c_v').get_coherence()\
for i in num_topics[:-1]]
From here derive the ideal number of topics roughly through the difference between the coherence and stability per number of topics:
coh_sta_diffs = [coherences[i] - mean_stabilities[i] for i in range(num_keywords)[:-1]] # limit topic numbers to the number of keywords
coh_sta_max = max(coh_sta_diffs)
coh_sta_max_idxs = [i for i, j in enumerate(coh_sta_diffs) if j == coh_sta_max]
ideal_topic_num_index = coh_sta_max_idxs[0] # choose less topics in case there's more than one max
ideal_topic_num = num_topics[ideal_topic_num_index]
Finally graph these metrics across the topic numbers:
plt.figure(figsize=(20,10))
ax = sns.lineplot(x=num_topics[:-1], y=mean_stabilities, label='Average Topic Overlap')
ax = sns.lineplot(x=num_topics[:-1], y=coherences, label='Topic Coherence')
ax.axvline(x=ideal_topic_num, label='Ideal Number of Topics', color='black')
ax.axvspan(xmin=ideal_topic_num - 1, xmax=ideal_topic_num + 1, alpha=0.5, facecolor='grey')
y_max = max(max(mean_stabilities), max(coherences)) + (0.10 * max(max(mean_stabilities), max(coherences)))
ax.set_ylim([0, y_max])
ax.set_xlim([1, num_topics[-1]-1])
ax.axes.set_title('Model Metrics per Number of Topics', fontsize=25)
ax.set_ylabel('Metric Level', fontsize=20)
ax.set_xlabel('Number of Topics', fontsize=20)
plt.legend(fontsize=20)
plt.show()
Your ideal number of topics will maximize coherence and minimize the topic overlap based on Jaccard similarity. In this case it looks like we'd be safe choosing topic numbers around 14.