Removing duplicate words in a string in R

后端未结

关注

 4  1082

Just to help someone who\'s just voluntarily removed their question, following a request for code he tried and other comments. Let\'s assume they tried something like this:

相关标签:

4条回答

有刺的猬

2020-12-11 04:05

There are no need additional package

str <- c("How do I best try and try and try and find a way to to improve this code?",
         "And and here's a second one one and not a third One.")

Atomic function:

rem_dup.one <- function(x){
  paste(unique(tolower(trimws(unlist(strsplit(x,split="(?!')[ [:punct:]]",fixed=F,perl=T))))),collapse = " ")
}
rem_dup.one("And and here's a second one one and not a third One.")

Vectorize

rem_dup.vector <- Vectorize(rem_dup.one,USE.NAMES = F)
rem_dup.vector(str)

REsult

"how do i best try and find a way to improve this code" "and here's a second one not third"

0 讨论(0)

情歌与酒

2020-12-11 04:09
If you are still interested in alternate solutions you can use unique which slightly simplifies your code.
```
paste(unique(d), collapse = ' ')
```
As per the comment by Thomas, you probably do want to remove punctuation. R's gsub has some nice internal patterns you can use instead of strict regex. Of course you can always specify specific instances if you want to do some more refined regex.
```
d <- gsub("[[:punct:]]", "", d)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

有刺的猬

2020-12-11 04:19

I'm not sure if string case is a concern. This solution uses qdap with the add-on qdapRegex package to make sure that punctuation and beginning string case doesn't interfere with the removal but is maintained:

str <- c("How do I best try and try and try and find a way to to improve this code?",
    "And and here's a second one one and not a third One.")

library(qdap)
library(dplyr) # so that pipe function (%>% can work) 

str %>% 
    tolower() %>%
    word_split() %>% 
    sapply(., function(x) unbag(unique(x))) %>% 
    rm_white_endmark() %>%  
    rm_default(pattern="(^[a-z]{1})", replacement = "\\U\\1") %>%
    unname()

## [1] "How do i best try and find a way to improve this code?"
## [2] "And here's a second one not third."

0 讨论(0)

暖寄归人

2020-12-11 04:25
To remove duplicate words except for any special characters. use this function
```
rem_dup_word <- function(x){
x <- tolower(x)
paste(unique(trimws(unlist(strsplit(x,split=" ",fixed=F,perl=T)))),collapse = 
" ")
}
```
Input data:
```
duptest <- "Samsung WA80E5LEC samsung Top Loading with Diamond Drum, 6 kg 
(Silver)"

rem_dup_word(duptest)
```
output:samsung wa80e5lec top loading with diamond drum 6 kg (silver).

It will treat "Samsung" and "SAMSUNG" as duplicate
0 讨论(0)
发布评论:

提交评论
- 加载中...

Removing duplicate words in a string in R

It will treat "Samsung" and "SAMSUNG" as duplicate