问题
I want to stem the documents in a Corpus of plain text documents using the tm package in R. When I apply the SnowballStemmer function to all documents of the corpus, only the last word of each document is stemmed.
library(tm)
library(Snowball)
library(RWeka)
library(rJava)
path <- c("C:/path/to/diretory")
corp <- Corpus(DirSource(path),
readerControl = list(reader = readPlain, language = "en_US",
load = TRUE))
tm_map(corp,SnowballStemmer) #stemDocument has the same problem
I think it is related to the way the documents are read into the corpus. To illustrate this with some simple examples:
> vec<-c("running runner runs","happyness happies")
> stemDocument(vec)
[1] "running runner run" "happyness happi"
> vec2<-c("running","runner","runs","happyness","happies")
> stemDocument(vec2)
[1] "run" "runner" "run" "happy" "happi" <-
> corp<-Corpus(VectorSource(vec))
> corp<-tm_map(corp, stemDocument)
> inspect(corp)
A corpus with 2 text documents
The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
create_date creator
Available variables in the data frame are:
MetaID
[[1]]
run runner run
[[2]]
happy happi
> corp2<-Corpus(DirSource(path),readerControl=list(reader=readPlain,language="en_US" , load=T))
> corp2<-tm_map(corp2, stemDocument)
> inspect(corp2)
A corpus with 2 text documents
The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
create_date creator
Available variables in the data frame are:
MetaID
$`1.txt`
running runner runs
$`2.txt`
happyness happies
回答1:
load required libraries
library(tm)
library(Snowball)
create vector
vec<-c("running runner runs","happyness happies")
create corpus from vector
vec<-Corpus(VectorSource(vec))
very important thing is to check class of our corpus and preserve it as we want a standard corpus that R functions understand
class(vec[[1]])
vec[[1]]
<<PlainTextDocument (metadata: 7)>>
running runner runs
this will probably tell you Plain text document
So now we modify our faulty stemDocument function. first we convert our plain text to character and then we split out text, apply stemDocument which works fine now and paste it back together. most importantly we reconvert output to PlainTextDocument given by tm package.
stemDocumentfix <- function(x)
{
PlainTextDocument(paste(stemDocument(unlist(strsplit(as.character(x), " "))),collapse=' '))
}
now we can use standard tm_map on our corpus
vec1 = tm_map(vec, stemDocumentfix)
result is
vec1[[1]]
<<PlainTextDocument (metadata: 7)>>
run runner run
most important thing you need remember is to presever class of documents in corpus always. i hope this is a simplified solution to your problem using function from within the 2 libraries loaded.
回答2:
The problem I see is that wordStem takes in a vector of words but Corpus plainTextReader assumes that in the documents that it reads, each word is on its own line. In other words, this would confuse plainTextReader as you will end up with 3 "words" in your document
From ancient grudge break to new mutiny,
Where civil blood makes civil hands unclean.
From forth the fatal loins of these two foes
Instead the document should be
From
ancient
grudge
break
to
new
mutiny
where
civil
...etc...
Note also that punctuation also confuses wordStem so you would have to take them out as well.
Another way to do this without modifying your actual documents is defining a function that would do the separation and remove non-alphanumerics that appear before or after a word. Here is a simple one:
wordStem2 <- function(x) {
mywords <- unlist(strsplit(x, " "))
mycleanwords <- gsub("^\\W+|\\W+$", "", mywords, perl=T)
mycleanwords <- mycleanwords[mycleanwords != ""]
wordStem(mycleanwords)
}
corpA <- tm_map(mycorpus, wordStem2);
corpB <- Corpus(VectorSource(corpA));
Now just use corpB as your usual Corpus.
来源:https://stackoverflow.com/questions/7263478/snowball-stemmer-only-stems-last-word