问题
I would like to remove duplicate rows based on >1 column using dplyr / tidyverse
Example
library(dplyr)
df <- data.frame(a=c(1,1,1,2,2,2), b=c(1,2,1,2,1,2), stringsAsFactors = F)
I thought this would return rows 3 and 6, but it returns 0 rows.
df %>% filter(duplicated(a, b))
# [1] a b
# <0 rows> (or 0-length row.names)
Conversely, I thought this would return rows 1,2,4 and 5, but it returns all rows.
df %>% filter(!duplicated(a, b))
# a b
# 1 1 1
# 2 1 2
# 3 1 1
# 4 2 2
# 5 2 1
# 6 2 2
What am I missing?
回答1:
duplicated expected to operate on "a vector or a data frame or an array" (but not two vectors ... it looks for duplication in its first argument only).
df %>%
filter(duplicated(.))
# a b
# 1 1 1
# 2 2 2
df %>%
filter(!duplicated(.))
# a b
# 1 1 1
# 2 1 2
# 3 2 2
# 4 2 1
If you prefer to reference a specific subset of columns, then use cbind
:
df %>%
filter(duplicated(cbind(a, b)))
As a side note, the dplyr
verb for this can be distinct
:
df %>%
distinct(a, b, .keep_all = TRUE)
# a b
# 1 1 1
# 2 1 2
# 3 2 2
# 4 2 1
though I don't know that it has an inverse of this function.
回答2:
Using unique
on the dataframe gives the unique rows.
unique(df)
# a b
#1 1 1
#2 1 2
#4 2 2
#5 2 1
Alternatively, a tidy approach could be to select 1st row of each group.
library(dplyr)
df %>% group_by(a, b) %>% slice(1L)
#If only 2 columns
#df %>% group_by_all() %>%slice(1L)
# a b
# <dbl> <dbl>
#1 1 1
#2 1 2
#3 2 1
#4 2 2
The inverse would be :
df %>% group_by(a, b) %>% filter(n() > 1) %>% distinct()
#If only 2 columns
#df %>% group_by_all() %>% filter(n() > 1) %>% distinct()
# a b
# <dbl> <dbl>
#1 1 1
#2 2 2
来源:https://stackoverflow.com/questions/60861391/remove-duplicate-rows-based-on-multiple-columns-using-dplyr-tidyverse