问题
I want to create a document term matrix using native R (without additional plugins such as tm). The data is structured as follows:
Doc1: the test was to test the test
Doc2: we did prepare the exam to test the exam
Doc3: was the test the exam
Doc4: the exam we did prepare was to test the test
Doc5: we were successful so we all passed the exam
What i want to reach is the following:
Term Doc1 Doc2 Doc3 Doc4 Doc5 DF
1 all 0 0 0 0 1 1
2 did 0 1 0 1 0 2
3 exam 0 2 1 1 1 4
4 passed 0 0 0 0 1 1
回答1:
Here's an approach but again why not use the tm package?
## Your data
## dat <- structure(list(person = structure(1:5, .Label = c("Doc1", "Doc2",
## "Doc3", "Doc4", "Doc5"), class = "factor"),
## text = c("the test was to test the test",
## "we did prepare the exam to test the exam", "was the test the exam",
## "the exam we did prepare was to test the test",
## "we were successful so we all passed the exam"
## )), .Names = c("doc", "text"), class = "data.frame", row.names = c(NA,
## -5L))
## Function to turn list of vects into sparse matrix
mtabulate <- function(vects) {
lev <- sort(unique(unlist(vects)))
dat <- do.call(rbind, lapply(vects, function(x, lev){
tabulate(factor(x, levels = lev, ordered = TRUE),
nbins = length(lev))}, lev = lev))
colnames(dat) <- sort(lev)
data.frame(dat, check.names = FALSE)
}
out <- lapply(split(dat$text, dat$doc), function(x) {
unlist(strsplit(tolower(x), " "))
})
t(mtabulate(out))
## Doc1 Doc2 Doc3 Doc4 Doc5
## all 0 0 0 0 1
## did 0 1 0 1 0
## exam 0 2 1 1 1
## passed 0 0 0 0 1
## prepare 0 1 0 1 0
## so 0 0 0 0 1
## successful 0 0 0 0 1
## test 3 1 1 2 0
## the 2 2 2 2 1
## to 1 1 0 1 0
## was 1 0 1 1 0
## we 0 1 0 1 2
## were 0 0 0 0 1
来源:https://stackoverflow.com/questions/19593885/how-to-create-a-document-term-matrix-using-native-r