LDA with topicmodels, how can I see which topics different documents belong to?

前端 未结 3 2079
迷失自我
迷失自我 2020-11-30 02:28

I am using LDA from the topicmodels package, and I have run it on about 30.000 documents, acquired 30 topics, and got the top 10 words for the topics, they look very good. B

相关标签:
3条回答
  • 2020-11-30 02:29

    To see which documents belong to which topic with the highest probability in topic models, simply use:

    topics(lda)
     1     2     3     4     5     6     7     8     9    10    11    12 
     60    41    64    19    94    93    12    64    12    33    59    28 
     13    14    15    16    17    18    19    20    21    22    23    24 
     87    19    98    69    61    18    27    18    87    96    44    65 
     25    26    27    28    29    30    31    32    33    34    35    36 
     98    77    19    56    76    51    47    38    55    38    92    96 
     37    38    39    40    41    42    43    44    45    46    47    48 
     19    19    19    38    79    21    17    21    59    24    49     2 
     49    50    51    52    53    54    55    56    57    58    59    60 
     66    65    41    36    68    19    70    50    54    37    27    77 
    

    To see the the topics generated from all the documents, simply use:

    terms(lda)
    Topic 1      Topic 2      Topic 3      Topic 4      Topic 5 
     "quite"       "food"       "lots"       "come"       "like" 
     Topic 6      Topic 7      Topic 8      Topic 9     Topic 10 
      "ever"     "around"        "bar"      "loved"        "new" 
    

    I hope this answers your question!

    External read that may help: http://www.rtexttools.com/1/post/2011/08/getting-started-with-latent-dirichlet-allocation-using-rtexttools-topicmodels.html

    Rachel Shuyan Wang

    0 讨论(0)
  • 2020-11-30 02:50

    How about this, using the built-in dataset. This will show you what documents belong to which topic with the highest probability.

    library(topicmodels)
    data("AssociatedPress", package = "topicmodels")
    
    k <- 5 # set number of topics
    # generate model
    lda <- LDA(AssociatedPress[1:20,], control = list(alpha = 0.1), k)
    # now we have a topic model with 20 docs and five topics
    
    # make a data frame with topics as cols, docs as rows and
    # cell values as posterior topic distribution for each document
    gammaDF <- as.data.frame(lda@gamma) 
    names(gammaDF) <- c(1:k)
    # inspect...
    gammaDF
                  1            2            3            4            5
    1  8.979807e-05 8.979807e-05 9.996408e-01 8.979807e-05 8.979807e-05
    2  8.714836e-05 8.714836e-05 8.714836e-05 8.714836e-05 9.996514e-01
    3  9.261396e-05 9.996295e-01 9.261396e-05 9.261396e-05 9.261396e-05
    4  9.995437e-01 1.140774e-04 1.140774e-04 1.140774e-04 1.140774e-04
    5  3.573528e-04 3.573528e-04 9.985706e-01 3.573528e-04 3.573528e-04
    6  5.610659e-05 5.610659e-05 5.610659e-05 5.610659e-05 9.997756e-01
    7  9.994345e-01 1.413820e-04 1.413820e-04 1.413820e-04 1.413820e-04
    8  4.286702e-04 4.286702e-04 4.286702e-04 9.982853e-01 4.286702e-04
    9  3.319338e-03 3.319338e-03 9.867226e-01 3.319338e-03 3.319338e-03
    10 2.034781e-04 2.034781e-04 9.991861e-01 2.034781e-04 2.034781e-04
    11 4.810342e-04 9.980759e-01 4.810342e-04 4.810342e-04 4.810342e-04
    12 2.651256e-04 9.989395e-01 2.651256e-04 2.651256e-04 2.651256e-04
    13 1.430945e-04 1.430945e-04 1.430945e-04 9.994276e-01 1.430945e-04
    14 8.402940e-04 8.402940e-04 8.402940e-04 9.966388e-01 8.402940e-04
    15 8.404830e-05 9.996638e-01 8.404830e-05 8.404830e-05 8.404830e-05
    16 1.903630e-04 9.992385e-01 1.903630e-04 1.903630e-04 1.903630e-04
    17 1.297372e-04 1.297372e-04 9.994811e-01 1.297372e-04 1.297372e-04
    18 6.906241e-05 6.906241e-05 6.906241e-05 9.997238e-01 6.906241e-05
    19 1.242780e-04 1.242780e-04 1.242780e-04 1.242780e-04 9.995029e-01
    20 9.997361e-01 6.597684e-05 6.597684e-05 6.597684e-05 6.597684e-05
    
    
    # Now for each doc, find just the top-ranked topic   
    toptopics <- as.data.frame(cbind(document = row.names(gammaDF), 
      topic = apply(gammaDF,1,function(x) names(gammaDF)[which(x==max(x))])))
    # inspect...
    toptopics   
           document topic
    1         1     2
    2         2     5
    3         3     1
    4         4     4
    5         5     4
    6         6     5
    7         7     2
    8         8     4
    9         9     1
    10       10     2
    11       11     3
    12       12     1
    13       13     1
    14       14     2
    15       15     1
    16       16     4
    17       17     4
    18       18     3
    19       19     4
    20       20     3
    

    Is that what you want to do?

    Hat-tip to this answer: https://stat.ethz.ch/pipermail/r-help/2010-August/247706.html

    0 讨论(0)
  • 2020-11-30 02:50
    ldaGibbs5 <- LDA(dtm,k,method="Gibbs")
    
    #get topics
    ldaGibbs5.topics <- as.matrix(topics(ldaGibbs5))
    write.csv(ldaGibbs5.topics,file=paste("LDAGibbs",k,"DocsToTopics.csv"))
    
    #get top 10 terms in each topic
    ldaGibbs5.terms <- as.matrix(terms(ldaGibbs5,10))
    write.csv(ldaGibbs5.terms,file=paste("LDAGibbs",k,"TopicsToTerms.csv"))
    
    #get probability of each topic in each doc
    topicProbabilities <- as.data.frame(ldaGibbs5@gamma)
    write.csv(topicProbabilities,file=paste("LDAGibbs",k,"TopicProbabilities.csv"))
    
    0 讨论(0)
提交回复
热议问题