Snowball Stemmer only stems last word

落花浮王杯 提交于 2019-12-07 01:40:59

问题


I want to stem the documents in a Corpus of plain text documents using the tm package in R. When I apply the SnowballStemmer function to all documents of the corpus, only the last word of each document is stemmed.

library(tm)
library(Snowball)
library(RWeka)
library(rJava)
path <- c("C:/path/to/diretory")
corp <- Corpus(DirSource(path),
               readerControl = list(reader = readPlain, language = "en_US",
                                    load = TRUE))
tm_map(corp,SnowballStemmer) #stemDocument has the same problem

I think it is related to the way the documents are read into the corpus. To illustrate this with some simple examples:

> vec<-c("running runner runs","happyness happies")
> stemDocument(vec) 
   [1] "running runner run" "happyness happi" 

> vec2<-c("running","runner","runs","happyness","happies")
> stemDocument(vec2)
   [1] "run"    "runner" "run"    "happy"  "happi" <- 

> corp<-Corpus(VectorSource(vec))
> corp<-tm_map(corp, stemDocument)
> inspect(corp)
   A corpus with 2 text documents

   The metadata consists of 2 tag-value pairs and a data frame
   Available tags are:
     create_date creator 
   Available variables in the data frame are:
     MetaID 

   [[1]]
   run runner run

   [[2]]
   happy happi

> corp2<-Corpus(DirSource(path),readerControl=list(reader=readPlain,language="en_US" ,  load=T))
> corp2<-tm_map(corp2, stemDocument)
> inspect(corp2)
   A corpus with 2 text documents

   The metadata consists of 2 tag-value pairs and a data frame
     Available tags are:
     create_date creator 
   Available variables in the data frame are:
     MetaID 

   $`1.txt`
   running runner runs

   $`2.txt`
   happyness happies

回答1:


load required libraries

library(tm)
library(Snowball)

create vector

vec<-c("running runner runs","happyness happies")

create corpus from vector

vec<-Corpus(VectorSource(vec))

very important thing is to check class of our corpus and preserve it as we want a standard corpus that R functions understand

class(vec[[1]])

vec[[1]]
<<PlainTextDocument (metadata: 7)>>
running runner runs

this will probably tell you Plain text document

So now we modify our faulty stemDocument function. first we convert our plain text to character and then we split out text, apply stemDocument which works fine now and paste it back together. most importantly we reconvert output to PlainTextDocument given by tm package.

stemDocumentfix <- function(x)
{
    PlainTextDocument(paste(stemDocument(unlist(strsplit(as.character(x), " "))),collapse=' '))
}

now we can use standard tm_map on our corpus

vec1 = tm_map(vec, stemDocumentfix)

result is

vec1[[1]]
<<PlainTextDocument (metadata: 7)>>
run runner run

most important thing you need remember is to presever class of documents in corpus always. i hope this is a simplified solution to your problem using function from within the 2 libraries loaded.




回答2:


The problem I see is that wordStem takes in a vector of words but Corpus plainTextReader assumes that in the documents that it reads, each word is on its own line. In other words, this would confuse plainTextReader as you will end up with 3 "words" in your document

From ancient grudge break to new mutiny,
Where civil blood makes civil hands unclean.
From forth the fatal loins of these two foes

Instead the document should be

From
ancient
grudge
break
to
new
mutiny
where 
civil
...etc...

Note also that punctuation also confuses wordStem so you would have to take them out as well.

Another way to do this without modifying your actual documents is defining a function that would do the separation and remove non-alphanumerics that appear before or after a word. Here is a simple one:

wordStem2 <- function(x) {
    mywords <- unlist(strsplit(x, " "))
    mycleanwords <- gsub("^\\W+|\\W+$", "", mywords, perl=T)
    mycleanwords <- mycleanwords[mycleanwords != ""]
    wordStem(mycleanwords)
}

corpA <- tm_map(mycorpus, wordStem2);
corpB <- Corpus(VectorSource(corpA));

Now just use corpB as your usual Corpus.



来源:https://stackoverflow.com/questions/7263478/snowball-stemmer-only-stems-last-word

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!