Snowball Stemmer only stems last word

最后都变了- 提交于 2019-12-05 06:05:08

load required libraries

library(tm)
library(Snowball)

create vector

vec<-c("running runner runs","happyness happies")

create corpus from vector

vec<-Corpus(VectorSource(vec))

very important thing is to check class of our corpus and preserve it as we want a standard corpus that R functions understand

class(vec[[1]])

vec[[1]]
<<PlainTextDocument (metadata: 7)>>
running runner runs

this will probably tell you Plain text document

So now we modify our faulty stemDocument function. first we convert our plain text to character and then we split out text, apply stemDocument which works fine now and paste it back together. most importantly we reconvert output to PlainTextDocument given by tm package.

stemDocumentfix <- function(x)
{
    PlainTextDocument(paste(stemDocument(unlist(strsplit(as.character(x), " "))),collapse=' '))
}

now we can use standard tm_map on our corpus

vec1 = tm_map(vec, stemDocumentfix)

result is

vec1[[1]]
<<PlainTextDocument (metadata: 7)>>
run runner run

most important thing you need remember is to presever class of documents in corpus always. i hope this is a simplified solution to your problem using function from within the 2 libraries loaded.

The problem I see is that wordStem takes in a vector of words but Corpus plainTextReader assumes that in the documents that it reads, each word is on its own line. In other words, this would confuse plainTextReader as you will end up with 3 "words" in your document

From ancient grudge break to new mutiny,
Where civil blood makes civil hands unclean.
From forth the fatal loins of these two foes

Instead the document should be

From
ancient
grudge
break
to
new
mutiny
where 
civil
...etc...

Note also that punctuation also confuses wordStem so you would have to take them out as well.

Another way to do this without modifying your actual documents is defining a function that would do the separation and remove non-alphanumerics that appear before or after a word. Here is a simple one:

wordStem2 <- function(x) {
    mywords <- unlist(strsplit(x, " "))
    mycleanwords <- gsub("^\\W+|\\W+$", "", mywords, perl=T)
    mycleanwords <- mycleanwords[mycleanwords != ""]
    wordStem(mycleanwords)
}

corpA <- tm_map(mycorpus, wordStem2);
corpB <- Corpus(VectorSource(corpA));

Now just use corpB as your usual Corpus.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!