I have a dataset of 200+ pdf\'s that I converted into a corpus. I\'m using the TM package for R for text pre-processing and mining. So far, I\'ve successfully created the DTM (
You can use the option dictionary
when you create your DocumentTermMatrix. See in the example code how it works. Once in the documenttermmatrix form or in a data.frame form you can use aggregation functions if you don't need the word counts per document.
library(tm)
data("crude")
crude <- as.VCorpus(crude)
crude <- tm_map(crude, content_transformer(tolower))
my_words <- c("oil", "corporation")
dtm <- DocumentTermMatrix(crude, control=list(dictionary = my_words))
# create data.frame from documenttermmatrix
df1 <- data.frame(docs = dtm$dimnames$Docs, as.matrix(dtm), row.names = NULL)
head(df1)
docs corporation oil
1 127 0 5
2 144 0 11
3 191 0 2
4 194 0 1
5 211 0 1
6 236 0 7