R tm removeWords function not removing words

后端 未结 2 1902
孤城傲影
孤城傲影 2021-01-05 01:21

I am trying to remove some words from a corpus I have built but it doesn\'t seem to be working. I first run through everything and create a dataframe that lists my words in

相关标签:
2条回答
  • 2021-01-05 01:54

    If someone gets error like me and above solution still doesn't work, try use: paperCorp <- tm_map(paperCorp, content_transformer(tolower)) instead of paperCorp <- tm_map(paperCorp, tolower) because tolower() is a function from base package and returns different structure (I mean changes something in the result type) so you can't use for example paperCorp[[j]]$content but only paperCorp[[j]]. It's just a digression, maybe halpful to someone.

    0 讨论(0)
  • 2021-01-05 02:00

    I switched some code around and added tolower. The stopwords are all in lowercase, so you need to do that first before you remove stopwords.

    paperCorp <- tm_map(paperCorp, removePunctuation)
    paperCorp <- tm_map(paperCorp, removeNumbers)
    # added tolower
    paperCorp <- tm_map(paperCorp, tolower)
    paperCorp <- tm_map(paperCorp, removeWords, stopwords("english"))
    # moved stripWhitespace
    paperCorp <- tm_map(paperCorp, stripWhitespace)
    
    paperCorp <- tm_map(paperCorp, stemDocument)
    

    Upper case words no longer needed, since we set everything to lower case. You can remove these.

    paperCorp <- tm_map(paperCorp, removeWords, c("also", "article", "Article", 
                                                   "download", "google", "figure",
                                                   "fig", "groups","Google", "however",
                                                   "high", "human", "levels",
                                                   "larger", "may", "number",
                                                   "shown", "study", "studies", "this",
                                                   "using", "two", "the", "Scholar",
                                                   "pubmedncbi", "PubMedNCBI",
                                                   "view", "View", "the", "biol",
                                                   "via", "image", "doi", "one", 
                                                   "analysis"))
    
    paperCorpPTD <- tm_map(paperCorp, PlainTextDocument)
    
    dtm <- DocumentTermMatrix(paperCorpPTD)
    
    termFreq <- colSums(as.matrix(dtm))
    head(termFreq)
    
    tf <- data.frame(term = names(termFreq), freq = termFreq)
    tf <- tf[order(-tf[,2]),]
    head(tf)
    
               term  freq
    fatty     fatty 29568
    pparα     ppara 23232
    acids     acids 22848
    gene       gene 15360
    dietary dietary 12864
    scholar scholar 11904
    
    tf[tf$term == "study"]
    
    
    data frame with 0 columns and 1659 rows
    

    And as you can see, the outcome is that study is no longer in the corpus. The rest of the words are also gone

    0 讨论(0)
提交回复
热议问题