Just to help someone who\'s just voluntarily removed their question, following a request for code he tried and other comments. Let\'s assume they tried something like this:
There are no need additional package
str <- c("How do I best try and try and try and find a way to to improve this code?",
"And and here's a second one one and not a third One.")
Atomic function:
rem_dup.one <- function(x){
paste(unique(tolower(trimws(unlist(strsplit(x,split="(?!')[ [:punct:]]",fixed=F,perl=T))))),collapse = " ")
}
rem_dup.one("And and here's a second one one and not a third One.")
Vectorize
rem_dup.vector <- Vectorize(rem_dup.one,USE.NAMES = F)
rem_dup.vector(str)
REsult
"how do i best try and find a way to improve this code" "and here's a second one not third"
If you are still interested in alternate solutions you can use unique
which slightly simplifies your code.
paste(unique(d), collapse = ' ')
As per the comment by Thomas, you probably do want to remove punctuation. R's gsub
has some nice internal patterns you can use instead of strict regex. Of course you can always specify specific instances if you want to do some more refined regex.
d <- gsub("[[:punct:]]", "", d)
I'm not sure if string case is a concern. This solution uses qdap
with the add-on qdapRegex
package to make sure that punctuation and beginning string case doesn't interfere with the removal but is maintained:
str <- c("How do I best try and try and try and find a way to to improve this code?",
"And and here's a second one one and not a third One.")
library(qdap)
library(dplyr) # so that pipe function (%>% can work)
str %>%
tolower() %>%
word_split() %>%
sapply(., function(x) unbag(unique(x))) %>%
rm_white_endmark() %>%
rm_default(pattern="(^[a-z]{1})", replacement = "\\U\\1") %>%
unname()
## [1] "How do i best try and find a way to improve this code?"
## [2] "And here's a second one not third."
To remove duplicate words except for any special characters. use this function
rem_dup_word <- function(x){
x <- tolower(x)
paste(unique(trimws(unlist(strsplit(x,split=" ",fixed=F,perl=T)))),collapse =
" ")
}
Input data:
duptest <- "Samsung WA80E5LEC samsung Top Loading with Diamond Drum, 6 kg
(Silver)"
rem_dup_word(duptest)
output:samsung wa80e5lec top loading with diamond drum 6 kg (silver).