I have a project requiring me to search annual reports of various companies and find key phrases in them. I have converted the reports to text files, created and cleaned a corpus. I then created a document term matrix. The tm_term_score function only seems to work for single words and not phrases. Is it possible to search the corpus for key phrases (not necessarily the most frequent)?
For example -
I want to see how many times the phrase “supply chain finance” in each document in the corpus. However when I run the code using tm_term_score - it returns that no documents had the phrase.. When they in fact did.
My progress looks as follows
setwd(‘C:/Users/Desktop/Annual Reports’)
dest<-“C:/Users/Desktop/Annual Reports”
a<-Corpus(DirSource(“C:/Users/Desktop/Annual Reports”), readerControl ≈ list (language ≈“lat”))
a<-tm_map(a, removeNumbers)
a<-tm_map(a, removeWords, stopwords(“english”))
a<-tm_map(a, removePunctuation)
a<-tm_map(a, stripWhitespace)
tokenizing.phrases<-c(“supply growth”,“import revenues”, “financing projects”)
I am quite weak and new to r and cannot decifier how to search my corpus for these key phrases.
Perhaps something like the following will help you.
First, create an object with your key phrases, such as
tokenizing.phrases <- c("general counsel", "chief legal officer", "inside counsel", "in-house counsel",
"law department", "law dept", "legal department", "legal function",
"law firm", "law firms", "external counsel", "outside counsel",
"law suit", "law suits", # can be hyphenated, eg.
"accounts payable", "matter management")
Then use this function (perhaps with tweaks for your needs).
phraseTokenizer <- function(x) {
x <- as.character(x) # extract the plain text from the tm TextDocument object
x <- str_trim(x)
if (is.na(x)) return("")
#warning(paste("doing:", x))
phrase.hits <- str_detect(x, ignore.case(tokenizing.phrases))
if (any(phrase.hits)) {
# only split once on the first hit, so not to worry about multiple occurrences of the same phrase
split.phrase <- tokenizing.phrases[which(phrase.hits)[1]]
# warning(paste("split phrase:", split.phrase))
temp <- unlist(str_split(x, ignore.case(split.phrase), 2))
out <- c(phraseTokenizer(temp[1]), split.phrase, phraseTokenizer(temp[2])) # this is recursive, since f() calls itself
} else {
out <- MC_tokenizer(x)
# get rid of any extraneous empty strings, which can happen if a phrase occurs just before a punctuation
out[out != ""]
Then create your term document matrix with the phrases included in it.
tdm <- TermDocumentMatrix(corpus, control = list(tokenize = phraseTokenizer))