how to determine the number of topics for LDA?

给你一囗甜甜゛ 提交于 2019-12-03 11:50:56

问题


I am a freshman in LDA and I want to use it in my work. However, some problems appear.

In order to get the best performance, I want to estimate the best topic number. After reading "Finding Scientific topics", I know that I can calculate logP(w|z) firstly and then use the harmonic mean of a series of P(w|z) to estimate P(w|T).

My question is what does the "a series of" mean?


回答1:


Unfortunately, there is no hard science yielding the correct answer to your question. To the best of my knowledge, hierarchical dirichlet process (HDP) is quite possibly the best way to arrive at the optimal number of topics.

If you are looking for deeper analyses, this paper on HDP reports the advantages of HDP in determining the number of groups.




回答2:


First some people use harmonic mean for finding optimal no.of topics and i also tried but results are unsatisfactory.So as per my suggestion ,if you are using R ,then package"ldatuning" will be useful.It has four metrics for calculating optimal no.of parameters. Again perplexity and log-likelihood based V-fold cross validation are also very good option for best topic modeling.V-Fold cross validation are bit time consuming for large dataset.You can see "A heuristic approach to determine an appropriate no.of topics in topic modeling". Important links: https://cran.r-project.org/web/packages/ldatuning/vignettes/topics.html https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4597325/




回答3:


A reliable way is to compute the topic coherence for different number of topics and choose the model that gives the highest topic coherence. But sometimes, the highest may not always fit the bill.

See this topic modeling example.




回答4:


Let k = number of topics

There is no single best way and I am not even sure if there is any standard practices for this.

Method 1: Try out different values of k, select the one that has the largest likelihood.

Method 2: Instead of LDA, see if you can use HDP-LDA

Method 3: If the HDP-LDA is infeasible on your corpus (because of corpus size), then take a uniform sample of your corpus and run HDP-LDA on that, take the value of k as given by HDP-LDA. For a small interval around this k, use Method 1.



来源:https://stackoverflow.com/questions/17421887/how-to-determine-the-number-of-topics-for-lda

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!