I have studied LDA and Topic model for several weeks.But due to my poor mathematics ability, i can not fully understand its inner algorithms.I have used the GibbsLDA implementation, input a lot of documents, and set topic number as 100, i got a file named "final.theta" which stores the topic proportion of each topic in each document.This result is good, i can use the topic proportion to do many other things. But when i tried Blei's C language implementation on LDA, i only got a file named final.gamma, but i don't know how to transform this file into topic proportion style. Can anyone help me. And i have learned that LDA model has many improved version(such as CTM,HLDA), if i can find a topic model similar to LDA, i mean when i input a lot of documents, it can directly output the topic proportion in the documents. Thank you very much!
问题:
回答1:
I think the problem with the Blei implementation is that you're doing variational inference by running:
$ lda inf [args...]
When you want to be doing topic estimation, with:
$ lda est [args...]
Once this runs, there will be a file "final.beta" in either the current directory or the directory specified by the optional last argument. Then you run the python script "topics.py", included in the tar. The readme here: http://www.cs.princeton.edu/~blei/lda-c/readme.txt describes it all, especially sections B and D.
(If this still doesn't make sense, let me know)
As far as improvements such as CTM etc: I don't know anything about HLDA, but I have used both LDA and CTM in the past, and I can say that neither is strictly better than the other - it's a case of being better for different data. CTM makes the assumption that documents are correlated, and uses that assumption to improve the results as long as it's true.
Hope this helps!
回答2:
To get E[θ] just normalize gammas within each row. It follows from the properties of Dirichlet distribution.