问题
I have a corpus of 39 text files named by the year - 1945.txt, 1978.txt.... 2013.txt.
I've imported them into R and created a Document Term Matrix using TM package. I'm trying to investigate how words associated with term'fraud' have changed over years from 1945 to 2013. The desired output would be a 39 by 10/5 matrix with years as row titles and top 10 or 5 terms as columns.
Any help would be greatly appreciated.
Thanks in advance.
Structure of my TDM:
> str(ytdm)
List of 6
$ i : int [1:6791] 5 7 8 17 32 41 42 55 58 71 ...
$ j : int [1:6791] 1 1 1 1 1 1 1 1 1 1 ...
$ v : num [1:6791] 2 4 2 2 2 8 4 3 2 2 ...
$ nrow : int 193
$ ncol : int 39
$ dimnames:List of 2
..$ Terms: chr [1:193] "abus" "access" "account" "accur" ...
..$ Docs : chr [1:39] "1947" "1976" "1977" "1978" ...
- attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
- attr(*, "Weighting")= chr [1:2] "term frequency" "tf"
My ideal output is like this:
1947 account accur gao medicine fed ......
1948 access .............
.
.
.
.
.
.
回答1:
Your example can't be replicated but findAssocs() is probably what you're looking for. Since you want to only look at associates on a yearly basis you'll need a dtm for each year.
> library(tm)
> data(crude)
> # i don't have your data so pretend this is corpus of docs for each year
> names(crude) <- rep(c("1999","2000"),10)
> # create a dtm for each year
> dtm.list <- lapply(unique(names(crude)),function(x) TermDocumentMatrix(crude[names(crude)==x]))
> # get associations for each year
> assoc.list <- lapply(dtm.list,findAssocs,term="oil",corlimit=0.7)
> names(assoc.list) <- unique(names(crude))
> assoc.list
$`1999`
prices barrel.
0.79 0.70
$`2000`
15.8 opec and said prices, sell the analysts clearly fixed
0.94 0.94 0.92 0.92 0.91 0.91 0.88 0.85 0.85 0.85
late meeting never that trying who winter emergency above but
0.85 0.85 0.85 0.85 0.85 0.85 0.85 0.84 0.83 0.83
world they mln market agreement before bpd buyers energy prices
0.82 0.80 0.79 0.78 0.75 0.75 0.75 0.75 0.75 0.75
set through under will not its
0.75 0.75 0.75 0.74 0.72 0.70
> # or if you want the 5 top terms
> assoc.list <- lapply(dtm.list,function(x) names(findAssocs(x,"oil",0)[1:5]))
> names(assoc.list) <- unique(names(crude))
> assoc.list
$`1999`
[1] "prices" "barrel." "said." "minister" "arabian"
$`2000`
[1] "15.8" "opec" "and" "said" "prices,"
来源:https://stackoverflow.com/questions/16695866/r-finding-the-top-10-terms-associated-with-the-term-fraud-across-documents-i