问题
I have a project requiring me to search annual reports of various companies and find key phrases in them. I have converted the reports to text files, created and cleaned a corpus. I then created a document term matrix. The tm_term_score function only seems to work for single words and not phrases. Is it possible to search the corpus for key phrases (not necessarily the most frequent)?
For example -
I want to see how many times the phrase “supply chain finance” in each document in the corpus. However when I run the code using tm_term_score - it returns that no documents had the phrase.. When they in fact did.
My progress looks as follows
library(tm)
library(stringr)
setwd(‘C:/Users/Desktop/Annual Reports’)
dest<-“C:/Users/Desktop/Annual Reports”
a<-Corpus(DirSource(“C:/Users/Desktop/Annual Reports”), readerControl ≈ list (language ≈“lat”))
a<-tm_map(a, removeNumbers)
a<-tm_map(a, removeWords, stopwords(“english”))
a<-tm_map(a, removePunctuation)
a<-tm_map(a, stripWhitespace)
tokenizing.phrases<-c(“supply growth”,“import revenues”, “financing projects”)
I am quite weak and new to r and cannot decifier how to search my corpus for these key phrases.
回答1:
Perhaps something like the following will help you.
First, create an object with your key phrases, such as
tokenizing.phrases <- c("general counsel", "chief legal officer", "inside counsel", "in-house counsel",
"law department", "law dept", "legal department", "legal function",
"law firm", "law firms", "external counsel", "outside counsel",
"law suit", "law suits", # can be hyphenated, eg.
"accounts payable", "matter management")
Then use this function (perhaps with tweaks for your needs).
phraseTokenizer <- function(x) {
require(stringr)
x <- as.character(x) # extract the plain text from the tm TextDocument object
x <- str_trim(x)
if (is.na(x)) return("")
#warning(paste("doing:", x))
phrase.hits <- str_detect(x, ignore.case(tokenizing.phrases))
if (any(phrase.hits)) {
# only split once on the first hit, so not to worry about multiple occurrences of the same phrase
split.phrase <- tokenizing.phrases[which(phrase.hits)[1]]
# warning(paste("split phrase:", split.phrase))
temp <- unlist(str_split(x, ignore.case(split.phrase), 2))
out <- c(phraseTokenizer(temp[1]), split.phrase, phraseTokenizer(temp[2])) # this is recursive, since f() calls itself
} else {
out <- MC_tokenizer(x)
}
# get rid of any extraneous empty strings, which can happen if a phrase occurs just before a punctuation
out[out != ""]
}
Then create your term document matrix with the phrases included in it.
tdm <- TermDocumentMatrix(corpus, control = list(tokenize = phraseTokenizer))
来源:https://stackoverflow.com/questions/31426501/finding-key-phrases-using-tm-package-in-r