R\'s duplicated
returns a vector showing whether each element of a vector or data frame is a duplicate of an element with a smaller subscript. So if rows 3, 4,
I had a similar problem but I needed to identify duplicated rows by values in specific columns. I came up with the following dplyr solution:
df <- df %>%
group_by(Column1, Column2, Column3) %>%
mutate(Duplicated = case_when(length(Column1)>1 ~ "Yes",
TRUE ~ "No")) %>%
ungroup()
The code groups the rows by specific columns. If the length of a group is greater than 1 the code marks all of the rows in the group as duplicated. Once that is done you can use Duplicated
column for filtering etc.
You need to assemble the set of duplicated
values, apply unique
, and then test with %in%
. As always, a sample problem will make this process come alive.
> vec <- c("a", "b", "c","c","c")
> vec[ duplicated(vec)]
[1] "c" "c"
> unique(vec[ duplicated(vec)])
[1] "c"
> vec %in% unique(vec[ duplicated(vec)])
[1] FALSE FALSE TRUE TRUE TRUE
Duplicated rows in a dataframe could be obtained with dplyr
by doing
df = bind_rows(iris, head(iris, 20)) # build some test data
df %>% group_by_all() %>% filter(n()>1) %>% ungroup()
To exclude certain columns group_by_at(vars(-var1, -var2))
could be used instead to group the data.
If the row indices and not just the data is actually needed, you could add them first as in:
df %>% add_rownames %>% group_by_at(vars(-rowname)) %>% filter(n()>1) %>% pull(rowname)
I've had the same question, and if I'm not mistaken, this is also an answer.
vec[col %in% vec[duplicated(vec$col),]$col]
Dunno which one is faster, though, the dataset I'm currently using isn't big enough to make tests which produce significant time gaps.
If you are interested in which rows are duplicated for certain columns you can use a plyr approach:
ddply(df, .(col1, col2), function(df) if(nrow(df) > 1) df else c())
Adding a count variable with dplyr:
df %>% add_count(col1, col2) %>% filter(n > 1) # data frame
df %>% add_count(col1, col2) %>% select(n) > 1 # logical vector
For duplicate rows (considering all columns):
df %>% group_by_all %>% add_tally %>% ungroup %>% filter(n > 1)
df %>% group_by_all %>% add_tally %>% ungroup %>% select(n) > 1
The benefit of these approaches is that you can specify how many duplicates as a cutoff.
Here is @Joshua Ulrich's solution as a function. This format allows you to use this code in the same fashion that you would use duplicated():
allDuplicated <- function(vec){
front <- duplicated(vec)
back <- duplicated(vec, fromLast = TRUE)
all_dup <- front + back > 0
return(all_dup)
}
Using the same example:
vec <- c("a", "b", "c","c","c")
allDuplicated(vec)
[1] FALSE FALSE TRUE TRUE TRUE