Adding custom stopwords in R tm

前端 未结 5 1428
梦谈多话
梦谈多话 2020-12-31 07:15

I have a Corpus in R using the tm package. I am applying the removeWords function to remove stopwords

tm_map(abs, removeWords, stop         


        
相关标签:
5条回答
  • 2020-12-31 07:27

    You could also use the textProcessor package. It works quite well:

    textProcessor(documents, 
      removestopwords = TRUE, customstopwords = NULL)
    
    0 讨论(0)
  • 2020-12-31 07:31

    Save your custom stop words in a csv file (ex: word.csv).

    library(tm)
    stopwords <- read.csv("word.csv", header = FALSE)
    stopwords <- as.character(stopwords$V1)
    stopwords <- c(stopwords, stopwords())
    

    Then you can apply custom words to your text file.

    text <- VectorSource(text)
    text <- VCorpus(text)
    text <- tm_map(text, content_transformer(tolower))
    text <- tm_map(text, removeWords, stopwords)
    text <- tm_map(text, stripWhitespace)
    
    text[[1]]$content
    
    0 讨论(0)
  • 2020-12-31 07:31

    It is possible to add your own stopwords to the default list of stopwords that came along with tm install. The "tm" package comes with many data files including stopwords, and note that stopwords files come for many languages. You can add, delete, or update the english.dat file under stopwords directory.
    The easiest way to find the stopwords directory is to search for "stopwords" directory in your system through your file browser. And you should find english.dat along with many other language files. Open the english.dat file from RStudio which should enable to edit the file - you can add your own words or drop existing words as needed. It is the same process if you want to edit stopwords in any other language.

    0 讨论(0)
  • 2020-12-31 07:38

    stopwords just provides you with a vector of words, just combine your own ones to this.

    tm_map(abs, removeWords, c(stopwords("english"),"my","custom","words")) 
    
    0 讨论(0)
  • 2020-12-31 07:38

    You can create a vector of your custom stopwords & use the statement like this:

    tm_map(abs, removeWords, c(stopwords("english"), myStopWords)) 
    
    0 讨论(0)
提交回复
热议问题