How to understand the output of Topic Model class in Mallet?

前端 未结 3 557
难免孤独
难免孤独 2021-02-03 11:49

As I\'m trying out the examples code on topic modeling developer\'s guide, I really want to understand the meaning of the output of that code.

First during the running p

相关标签:
3条回答
  • 2021-02-03 12:17

    I don't think I know enough to give a very complete answer, but here's a shot at some of it... for Q1 you can inspect some code to see how those values are calculated. For Q2, LL is the model's log-liklihood divided by the total number of tokens, this is a measure of how likely the data are given the model. Increasing values mean the model is improving. These are also available in the R packages for topic modeling. Q2, yes I think that's right for the first line. Q3, good question, it's not immediately clear to me, perhaps the (x) are some kind of index, token frequency seems unlikely... Presumably most of these are diagnostics of some kind.

    A more useful set of diagnostics can be obtained with bin\mallet run cc.mallet.topics.tui.TopicTrainer ...your various options... --diagnostics-file diagnostics.xml which will produce a large number of measures of topic quality. They're definitely worth checking out.

    For the full story about all of this I'd suggest writing an email to David Mimno at Princeton who is the (main?) maintainer of MALLET, or writing to him via the list at http://blog.gmane.org/gmane.comp.ai.mallet.devel and then posting answers back here for those of us curious about the inner workings of MALLET...

    0 讨论(0)
  • 2021-02-03 12:27

    For question 3, I believe the 0.008 (the "topic distribution") relates to the prior \alpha over topic distributions for documents. Mallet optimises this prior, essentially allowing some topics to carry more "weight". Mallet seems to be estimating that topic 0 accounts for a small proportion of your corpus.

    The token counts represent only the words with highest counts. The remaining counts for topic 0 could, for example, be 0, and the remaining counts for topic 9 could be 3. Thus topic 9 can account for many more words in your corpus than topic 0, even though the counts for the top words are lower.

    I'd have to check out the code for the "0 0.55" at the end, but that's probably the optimised \beta value (which I'm pretty sure isn't done asymetrically).

    0 讨论(0)
  • 2021-02-03 12:31

    what I understand is that:

    0   0.008   battle (8) union (7) confederate (6) grant (4) tennessee (4)   
    
    • 0 is the topic number.
    • 0.008 is the weight of such topic
    • battle (8) union (7) [...] are the top-keywords in such topic. The numbers are the occurrences of the word in the topic.

    Then, as result, you also obtain a .csv file. I think it contains the most important data of the process. You will find values like the following for each row:

    0   0   285 10   page make items thing work put dec browsers recipes expressions 
    

    That is:

    1. Tree level
    2. Topic ID
    3. Total words
    4. Total documents
    5. Top-10 words

    A little bit late, but I hope it helps someone

    0 讨论(0)
提交回复
热议问题