R : Finding the top 10 terms associated with the term 'fraud' across documents in a Document Term Matrix in R

问题

I have a corpus of 39 text files named by the year - 1945.txt, 1978.txt.... 2013.txt.

I've imported them into R and created a Document Term Matrix using TM package. I'm trying to investigate how words associated with term'fraud' have changed over years from 1945 to 2013. The desired output would be a 39 by 10/5 matrix with years as row titles and top 10 or 5 terms as columns.

Any help would be greatly appreciated.

Thanks in advance.

Structure of my TDM:

> str(ytdm)
List of 6
 $ i       : int [1:6791] 5 7 8 17 32 41 42 55 58 71 ...
 $ j       : int [1:6791] 1 1 1 1 1 1 1 1 1 1 ...
 $ v       : num [1:6791] 2 4 2 2 2 8 4 3 2 2 ...
 $ nrow    : int 193
 $ ncol    : int 39
 $ dimnames:List of 2
  ..$ Terms: chr [1:193] "abus" "access" "account" "accur" ...
  ..$ Docs : chr [1:39] "1947" "1976" "1977" "1978" ...
 - attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
 - attr(*, "Weighting")= chr [1:2] "term frequency" "tf"

My ideal output is like this:


1947   account accur gao medicine fed ......
1948   access  .............
.
.
.
.
.
.

回答1:

Your example can't be replicated but findAssocs() is probably what you're looking for. Since you want to only look at associates on a yearly basis you'll need a dtm for each year.

> library(tm)
> data(crude)
> # i don't have your data so pretend this is corpus of docs for each year
> names(crude) <- rep(c("1999","2000"),10)
> # create a dtm for each year
> dtm.list <- lapply(unique(names(crude)),function(x) TermDocumentMatrix(crude[names(crude)==x]))
> # get associations for each year
> assoc.list <- lapply(dtm.list,findAssocs,term="oil",corlimit=0.7)
> names(assoc.list) <- unique(names(crude))
> assoc.list
$`1999`
 prices barrel. 
   0.79    0.70 

$`2000`
     15.8      opec       and      said   prices,      sell       the  analysts   clearly     fixed 
     0.94      0.94      0.92      0.92      0.91      0.91      0.88      0.85      0.85      0.85 
     late   meeting     never      that    trying       who    winter emergency     above       but 
     0.85      0.85      0.85      0.85      0.85      0.85      0.85      0.84      0.83      0.83 
    world      they       mln    market agreement    before       bpd    buyers    energy    prices 
     0.82      0.80      0.79      0.78      0.75      0.75      0.75      0.75      0.75      0.75 
      set   through     under      will       not       its 
     0.75      0.75      0.75      0.74      0.72      0.70 

> # or if you want the 5 top terms
> assoc.list <- lapply(dtm.list,function(x) names(findAssocs(x,"oil",0)[1:5]))
> names(assoc.list) <- unique(names(crude))
> assoc.list
$`1999`
[1] "prices"   "barrel."  "said."    "minister" "arabian" 

$`2000`
[1] "15.8"    "opec"    "and"     "said"    "prices,"

来源：https://stackoverflow.com/questions/16695866/r-finding-the-top-10-terms-associated-with-the-term-fraud-across-documents-i

标签

word-frequency

term-document-matrix