问题
I have a dataset in which I need to conditionally remove duplicated rows based on values in another column.
Specifically, I need to delete any row where size = 0
only if SampleID
is duplicated.
SampleID<-c("a", "a", "b", "b", "b", "c", "d", "d", "e")
size<-c(0, 1, 1, 2, 3, 0, 0, 1, 0)
data<-data.frame(SampleID, size)
I want to delete rows with:
Sample ID size
a 0
d 0
And keep:
SampleID size
a 1
b 1
b 2
b 3
c 0
d 1
e 0
Note. actual dataset it very large, so I am not looking for a way to just remove a known row by row number.
回答1:
Using data.table
framework: Transform your set to data.table
require(data.table)
setDT(data)
Build a list of id where we can delete lines:
dropable_ids = unique(data[size != 0, SampleID])
Finaly keep lines that are not in the dropable list or with non 0 value
data = data[!(SampleID %in% dropable_ids & size == 0), ]
Please note that not( a and b )
is equivalent to a or b
but data.table framework doesn't handle well or
.
Hope it helps
回答2:
A solution that works in base R without data.table and is easy to follow through for R starters:
#Find all duplicates
data$dup1 <- duplicated(data$SampleID)
data$dup2 <- duplicated(data$SampleID, fromLast = TRUE)
data$dup <- ifelse(data$dup1 == TRUE | data$dup2 == TRUE, 1, 0)
#Subset to relevant
data$drop <- ifelse(data$dup == 1 & data$size == 0, 1, 0)
data2 <- subset(data, drop == 0)
回答3:
In dplyr
we can do this using group_by
and filter
:
library(dplyr)
data %>%
group_by(SampleID) %>%
filter(!(size==0 & n() > 1))
#> # A tibble: 7 x 2
#> # Groups: SampleID [5]
#> SampleID size
#> <fct> <dbl>
#> 1 a 1
#> 2 b 1
#> 3 b 2
#> 4 b 3
#> 5 c 0
#> 6 d 1
#> 7 e 0
来源:https://stackoverflow.com/questions/46119308/conditionally-removing-duplicates-in-r