Finding ALL duplicate rows, including “elements with smaller subscripts”

后端未结

关注

 7  732

R\'s duplicated returns a vector showing whether each element of a vector or data frame is a duplicate of an element with a smaller subscript. So if rows 3, 4,

相关标签:

7条回答

Happy的楠姐

2020-11-21 08:25
I had a similar problem but I needed to identify duplicated rows by values in specific columns. I came up with the following dplyr solution:
```
df <- df %>% 
  group_by(Column1, Column2, Column3) %>% 
  mutate(Duplicated = case_when(length(Column1)>1 ~ "Yes",
                            TRUE ~ "No")) %>%
  ungroup()
```
The code groups the rows by specific columns. If the length of a group is greater than 1 the code marks all of the rows in the group as duplicated. Once that is done you can use Duplicated column for filtering etc.
0 讨论(0)
发布评论:

提交评论
- 加载中...
时光说笑

2020-11-21 08:30
You need to assemble the set of duplicated values, apply unique, and then test with %in%. As always, a sample problem will make this process come alive.
```
> vec <- c("a", "b", "c","c","c")
> vec[ duplicated(vec)]
[1] "c" "c"
> unique(vec[ duplicated(vec)])
[1] "c"
>  vec %in% unique(vec[ duplicated(vec)]) 
[1] FALSE FALSE  TRUE  TRUE  TRUE
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
萌比男神i

2020-11-21 08:36
Duplicated rows in a dataframe could be obtained with dplyr by doing
```
df = bind_rows(iris, head(iris, 20)) # build some test data
df %>% group_by_all() %>% filter(n()>1) %>% ungroup()
```
To exclude certain columns group_by_at(vars(-var1, -var2)) could be used instead to group the data.

If the row indices and not just the data is actually needed, you could add them first as in:
```
df %>% add_rownames %>% group_by_at(vars(-rowname)) %>% filter(n()>1) %>% pull(rowname)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
抹茶落季

2020-11-21 08:37
I've had the same question, and if I'm not mistaken, this is also an answer.
```
vec[col %in% vec[duplicated(vec$col),]$col]
```
Dunno which one is faster, though, the dataset I'm currently using isn't big enough to make tests which produce significant time gaps.
0 讨论(0)
发布评论:

提交评论
- 加载中...
暗喜

2020-11-21 08:37
If you are interested in which rows are duplicated for certain columns you can use a plyr approach:
```
ddply(df, .(col1, col2), function(df) if(nrow(df) > 1) df else c())
```
Adding a count variable with dplyr:
```
df %>% add_count(col1, col2) %>% filter(n > 1)  # data frame
df %>% add_count(col1, col2) %>% select(n) > 1  # logical vector
```
For duplicate rows (considering all columns):
```
df %>% group_by_all %>% add_tally %>% ungroup %>% filter(n > 1)
df %>% group_by_all %>% add_tally %>% ungroup %>% select(n) > 1
```
The benefit of these approaches is that you can specify how many duplicates as a cutoff.
0 讨论(0)
发布评论:

提交评论
- 加载中...

逝去的感伤

2020-11-21 08:39

Here is @Joshua Ulrich's solution as a function. This format allows you to use this code in the same fashion that you would use duplicated():

allDuplicated <- function(vec){
  front <- duplicated(vec)
  back <- duplicated(vec, fromLast = TRUE)
  all_dup <- front + back > 0
  return(all_dup)
}

Using the same example:

vec <- c("a", "b", "c","c","c") 
allDuplicated(vec) 
[1] FALSE FALSE  TRUE  TRUE  TRUE

0 讨论(0)

1 2 下一页