ID Cat1 Cat2 Cat3 Cat4
A0001 358 11.25 37428 0
A0001 279 14.6875 38605 0
A0013 367 5.125 40152 1
A0014 337 16.3125 38624 0
A0020 367
General comments.
ave
approach is the only one here that preserves the data's initial row ordering. by
approach should be very slow. I suspect that data.table and dplyr are not much faster than ave
and tapply
(yet) at selecting groups. Benchmarks to prove me wrong welcome!base R (Thanks to @thelatemail for both of the first two approaches.)
1) Each row is assigned the length of its df$ID
group, and we filter based on the vector of lengths.
df[ ave(1:nrow(df), df$ID, FUN=length) > 1 , ]
2) Alternately, we split row names or numbers by df$ID
, selecting which groups' rows to keep. tapply
returns a list of groups of rows, so we must unlist
them into a single vector of rows.
df[ unlist(tapply(1:nrow(df), df$ID, function(x) if (length(x) > 1) x)) , ]
What follows is a worse approach, but better parallels what you see with data.table and dplyr:
3) The data is split by df$ID
, keeping each subset of data, SD
if if has more than one row. by
returns a list, so we must rbind
them back together.
do.call( rbind, c(list(make.row.names = FALSE),
by(df, df$ID, FUN=function(SD) if (nrow(SD) > 1) SD )))
data.table .N
corresponds to nrow
within a by=ID
group; and .SD
is the subset of data.
library(data.table)
setDT(df)[, if (.N>1) .SD, by=ID]
# ID Cat1 Cat2 Cat3 Cat4
# 1: A0001 358 11.2500 37428 0
# 2: A0001 279 14.6875 38605 0
# 3: A0020 367 8.8750 37797 0
# 4: A0020 339 9.6250 39324 0
dplyr n()
corresponds to nrow
within a group_by(ID)
group.
library(dplyr)
df %>% group_by(ID) %>% filter( n() > 1 )
# Source: local data frame [4 x 5]
# Groups: ID
#
# ID Cat1 Cat2 Cat3 Cat4
# 1 A0001 358 11.2500 37428 0
# 2 A0001 279 14.6875 38605 0
# 3 A0020 367 8.8750 37797 0
# 4 A0020 339 9.6250 39324 0
Another option in base R Using duplicated
dx[dx$ID %in% dx$ID[duplicated(dx$ID)],]
# ID Cat1 Cat2 Cat3 Cat4
# 1 A0001 358 11.2500 37428 0
# 2 A0001 279 14.6875 38605 0
# 5 A0020 367 8.8750 37797 0
# 6 A0020 339 9.6250 39324 0
using duplicated
and fromLast
version you get :
library(data.table)
setkey(setDT(dx),ID) # or with data.table 1.9.5+: setDT(dx,key="ID")
dx[duplicated(dx) |duplicated(dx,fromLast=T)]
# ID Cat1 Cat2 Cat3 Cat4
# 1: A0001 358 11.2500 37428 0
# 2: A0001 279 14.6875 38605 0
# 3: A0020 367 8.8750 37797 0
# 4: A0020 339 9.6250 39324 0
This can be applied to base R also but I prefer data.table here for syntax sugar.