Removing stopwords from a user-defined corpus in R

问题

I have a set of documents:

documents = c("She had toast for breakfast",
 "The coffee this morning was excellent", 
 "For lunch let's all have pancakes", 
 "Later in the day, there will be more talks", 
 "The talks on the first day were great", 
 "The second day should have good presentations too")

In this set of documents, I would like to remove the stopwords. I have already removed punctuation and converted to lower case, using:

documents = tolower(documents) #make it lower case
documents = gsub('[[:punct:]]', '', documents) #remove punctuation

First I convert to a Corpus object:

documents <- Corpus(VectorSource(documents))

Then I try to remove the stopwords:

documents = tm_map(documents, removeWords, stopwords('english')) #remove stopwords

But this last line results in the following error:

THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC() to debug.

This has already been asked here but an answer was not given. What does this error mean?

EDIT

Yes, I am using the tm package.

Here is the output of sessionInfo():

R version 3.0.2 (2013-09-25) Platform: x86_64-apple-darwin10.8.0 (64-bit)

回答1:

When I run into tm problems I often end up just editing the original text.

For removing words it's a little awkward, but you can paste together a regex from tm's stopword list.

stopwords_regex = paste(stopwords('en'), collapse = '\\b|\\b')
stopwords_regex = paste0('\\b', stopwords_regex, '\\b')
documents = stringr::str_replace_all(documents, stopwords_regex, '')

> documents
[1] "     toast  breakfast"             " coffee  morning  excellent"      
[3] " lunch lets   pancakes"            "later   day  will   talks"        
[5] " talks   first day  great"         " second day   good presentations "

回答2:

Maybe try to use the tm_map function to transform the document. It seems to work in my case.

> documents = c("She had toast for breakfast",
+  "The coffee this morning was excellent", 
+  "For lunch let's all have pancakes", 
+  "Later in the day, there will be more talks", 
+  "The talks on the first day were great", 
+  "The second day should have good presentations too")
> library(tm)
Loading required package: NLP
> documents <- Corpus(VectorSource(documents))
> documents = tm_map(documents, content_transformer(tolower))
> documents = tm_map(documents, removePunctuation)
> documents = tm_map(documents, removeWords, stopwords("english"))
> documents
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 6

This yields

> documents[[1]]$content
[1] "  toast  breakfast"
> documents[[2]]$content
[1] " coffee  morning  excellent"
> documents[[3]]$content
[1] " lunch lets   pancakes"
> documents[[4]]$content
[1] "later   day  will   talks"
> documents[[5]]$content
[1] " talks   first day  great"
> documents[[6]]$content
[1] " second day   good presentations "

回答3:

you can use quanteda package to remove stop words, but first make sure your words are tokens and then use the following:

library(quanteda)
x<- tokens_select(x,stopwords(), selection=)

来源：https://stackoverflow.com/questions/37526550/removing-stopwords-from-a-user-defined-corpus-in-r

标签

topic-modeling