how to remove unique entry and keep duplicates in R

后端 未结 2 1740
一生所求 2020-12-09 11:15
ID     Cat1  Cat2    Cat3   Cat4
A0001   358 11.25   37428   0
A0001   279 14.6875 38605   0
A0013   367 5.125   40152   1
A0014   337 16.3125 38624   0
A0020   367          

  • 2020-12-09 12:09

    General comments.

    • The ave approach is the only one here that preserves the data's initial row ordering.
    • The by approach should be very slow. I suspect that data.table and dplyr are not much faster than ave and tapply (yet) at selecting groups. Benchmarks to prove me wrong welcome!

    base R (Thanks to @thelatemail for both of the first two approaches.)

    1) Each row is assigned the length of its df$ID group, and we filter based on the vector of lengths.

    df[ ave(1:nrow(df), df$ID, FUN=length) > 1 , ]

    2) Alternately, we split row names or numbers by df$ID, selecting which groups' rows to keep. tapply returns a list of groups of rows, so we must unlist them into a single vector of rows.

    df[ unlist(tapply(1:nrow(df), df$ID, function(x) if (length(x) > 1) x)) , ]

    What follows is a worse approach, but better parallels what you see with data.table and dplyr:

    3) The data is split by df$ID, keeping each subset of data, SD if if has more than one row. by returns a list, so we must rbind them back together. rbind, c(list(make.row.names = FALSE),
        by(df, df$ID, FUN=function(SD) if (nrow(SD) > 1) SD )))

    data.table .N corresponds to nrow within a by=ID group; and .SD is the subset of data.

    setDT(df)[, if (.N>1) .SD, by=ID]
    #       ID Cat1    Cat2  Cat3 Cat4
    # 1: A0001  358 11.2500 37428    0
    # 2: A0001  279 14.6875 38605    0
    # 3: A0020  367  8.8750 37797    0
    # 4: A0020  339  9.6250 39324    0

    dplyr n() corresponds to nrow within a group_by(ID) group.

    df %>% group_by(ID) %>% filter( n() > 1 )
    # Source: local data frame [4 x 5]
    # Groups: ID
    #      ID Cat1    Cat2  Cat3 Cat4
    # 1 A0001  358 11.2500 37428    0
    # 2 A0001  279 14.6875 38605    0
    # 3 A0020  367  8.8750 37797    0
    # 4 A0020  339  9.6250 39324    0
    0 讨论(0)
  • 2020-12-09 12:11

    Another option in base R Using duplicated

    dx[dx$ID %in% dx$ID[duplicated(dx$ID)],]
    #      ID Cat1    Cat2  Cat3 Cat4
    # 1 A0001  358 11.2500 37428    0
    # 2 A0001  279 14.6875 38605    0
    # 5 A0020  367  8.8750 37797    0
    # 6 A0020  339  9.6250 39324    0

    data.table using duplicated

    using duplicated and fromLast version you get :

    setkey(setDT(dx),ID) # or with data.table 1.9.5+: setDT(dx,key="ID")
    dx[duplicated(dx) |duplicated(dx,fromLast=T)]
    #       ID Cat1    Cat2  Cat3 Cat4
    # 1: A0001  358 11.2500 37428    0
    # 2: A0001  279 14.6875 38605    0
    # 3: A0020  367  8.8750 37797    0
    # 4: A0020  339  9.6250 39324    0

    This can be applied to base R also but I prefer data.table here for syntax sugar.

    0 讨论(0)