multiple results of one variable when applying tm method “stemCompletion”

前端未结

关注

 1  1190

长情又很酷 2021-01-14 07:06

I have a corpus containing journal data of 15 observations of 3 variables (ID, title, abstract). Using R Studio I read in the data from a .csv file (one line per observation

1条回答

夕颜 (楼主)

2021-01-14 08:00

There seems to be something odd about the stemCompletion function. It's not obvious how to use stemCompletion in the tm version 0.6. There is a nice workaround here that I've used for this answer.

First, make the CSV file that you have:

dat <- read.csv2( text = 
                  "ID;Text A;Text B
1;Below is the first title;Innovation and Knowledge Management
2;And now the second Title;Organizational Performance and Learning are very important
3;The third title;Knowledge plays an important rule in organizations")

write.csv2(dat, "Test.csv", row.names = FALSE)

Read it in, transform to a corpus, and stem the words:

data = read.csv2("Test.csv")
data[,2]=as.character(data[,2])
data[,3]=as.character(data[,3])

corpus <- Corpus(DataframeSource(data)) 
corpuscopy <- corpus
library(SnowballC)
corpus <- tm_map(corpus, stemDocument)

Have a look to see that it's worked:

inspect(corpus)

<>

[[1]]
<>
1
Below is the first titl
Innovat and Knowledg Manag

[[2]]
<>
2
And now the second Titl
Organiz Perform and Learn are veri import

[[3]]
<>
3
The third titl
Knowledg play an import rule in organ

Here's the nice workaround to get stemCompletion working:

stemCompletion_mod <- function(x,dict=corpuscopy) {
  PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")),dictionary=dict, type="shortest"),sep="", collapse=" ")))
}

Inspect the output to see if the stems were completed ok:

lapply(corpus, stemCompletion_mod)

[[1]]
<>
1 Below is the first title Innovation and Knowledge Management

[[2]]
<>
2 And now the second Title Organizational Performance and Learning are NA important

[[3]]
<>
3 The third title Knowledge plays an important rule in organizations

Success!

0 讨论(0)