Removing duplicate words in a string in R

后端 未结 4 1082
栀梦
栀梦 2020-12-11 03:47

Just to help someone who\'s just voluntarily removed their question, following a request for code he tried and other comments. Let\'s assume they tried something like this:

相关标签:
4条回答
  • 2020-12-11 04:05

    There are no need additional package

    str <- c("How do I best try and try and try and find a way to to improve this code?",
             "And and here's a second one one and not a third One.")
    

    Atomic function:

    rem_dup.one <- function(x){
      paste(unique(tolower(trimws(unlist(strsplit(x,split="(?!')[ [:punct:]]",fixed=F,perl=T))))),collapse = " ")
    }
    rem_dup.one("And and here's a second one one and not a third One.")
    

    Vectorize

    rem_dup.vector <- Vectorize(rem_dup.one,USE.NAMES = F)
    rem_dup.vector(str)
    

    REsult

    "how do i best try and find a way to improve this code" "and here's a second one not third" 
    
    0 讨论(0)
  • 2020-12-11 04:09

    If you are still interested in alternate solutions you can use unique which slightly simplifies your code.

    paste(unique(d), collapse = ' ')
    

    As per the comment by Thomas, you probably do want to remove punctuation. R's gsub has some nice internal patterns you can use instead of strict regex. Of course you can always specify specific instances if you want to do some more refined regex.

    d <- gsub("[[:punct:]]", "", d)
    
    0 讨论(0)
  • 2020-12-11 04:19

    I'm not sure if string case is a concern. This solution uses qdap with the add-on qdapRegex package to make sure that punctuation and beginning string case doesn't interfere with the removal but is maintained:

    str <- c("How do I best try and try and try and find a way to to improve this code?",
        "And and here's a second one one and not a third One.")
    
    library(qdap)
    library(dplyr) # so that pipe function (%>% can work) 
    
    str %>% 
        tolower() %>%
        word_split() %>% 
        sapply(., function(x) unbag(unique(x))) %>% 
        rm_white_endmark() %>%  
        rm_default(pattern="(^[a-z]{1})", replacement = "\\U\\1") %>%
        unname()
    
    ## [1] "How do i best try and find a way to improve this code?"
    ## [2] "And here's a second one not third."
    
    0 讨论(0)
  • 2020-12-11 04:25

    To remove duplicate words except for any special characters. use this function

    rem_dup_word <- function(x){
    x <- tolower(x)
    paste(unique(trimws(unlist(strsplit(x,split=" ",fixed=F,perl=T)))),collapse = 
    " ")
    }
    

    Input data:

    duptest <- "Samsung WA80E5LEC samsung Top Loading with Diamond Drum, 6 kg 
    (Silver)"
    
    rem_dup_word(duptest)
    

    output:samsung wa80e5lec top loading with diamond drum 6 kg (silver).

    It will treat "Samsung" and "SAMSUNG" as duplicate

    0 讨论(0)
提交回复
热议问题