I have a corpus containing journal data of 15 observations of 3 variables (ID, title, abstract). Using R Studio I read in the data from a .csv file (one line per observation
There seems to be something odd about the stemCompletion
function. It's not obvious how to use stemCompletion
in the tm
version 0.6. There is a nice workaround here that I've used for this answer.
First, make the CSV file that you have:
dat <- read.csv2( text =
"ID;Text A;Text B
1;Below is the first title;Innovation and Knowledge Management
2;And now the second Title;Organizational Performance and Learning are very important
3;The third title;Knowledge plays an important rule in organizations")
write.csv2(dat, "Test.csv", row.names = FALSE)
Read it in, transform to a corpus, and stem the words:
data = read.csv2("Test.csv")
data[,2]=as.character(data[,2])
data[,3]=as.character(data[,3])
corpus <- Corpus(DataframeSource(data))
corpuscopy <- corpus
library(SnowballC)
corpus <- tm_map(corpus, stemDocument)
Have a look to see that it's worked:
inspect(corpus)
<>
[[1]]
<>
1
Below is the first titl
Innovat and Knowledg Manag
[[2]]
<>
2
And now the second Titl
Organiz Perform and Learn are veri import
[[3]]
<>
3
The third titl
Knowledg play an import rule in organ
Here's the nice workaround to get stemCompletion
working:
stemCompletion_mod <- function(x,dict=corpuscopy) {
PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")),dictionary=dict, type="shortest"),sep="", collapse=" ")))
}
Inspect the output to see if the stems were completed ok:
lapply(corpus, stemCompletion_mod)
[[1]]
<>
1 Below is the first title Innovation and Knowledge Management
[[2]]
<>
2 And now the second Title Organizational Performance and Learning are NA important
[[3]]
<>
3 The third title Knowledge plays an important rule in organizations
Success!