Match/group duplicate rows (indices)

前端 未结 2 2007
情深已故
情深已故 2021-02-05 09:47

How can I efficiently match/group the indices of duplicated rows?

Let\'s say I have this data set:

set.seed(14)
dat <- data.frame(mtc         


        
相关标签:
2条回答
  • 2021-02-05 10:39

    We can use dplyr. Using a similar methodology as @AnandaMahto's post, we create a row index column name (add_rownames(), group by all the columns, we filter the dataset with number of rows in each group greater than 1, summarise the 'rowname' to a list and extract that list column.

    library(dplyr)
    add_rownames(dat) %>% 
          group_by_(.dots= names(dat)) %>% 
          filter(n()>1) %>%
          summarise(rn= list(rowname))%>%
          .$rn
     #[[1]]
     #[1] "3"  "7"  "8"  "10" "11"
    
     #[[2]]
     #[1] "2"  "13"
    
     #[[3]]
     #[1] "1" "4" "5" "6" "9"
    
    0 讨论(0)
  • 2021-02-05 10:42

    Here's a possibility using "data.table":

    library(data.table)
    as.data.table(dat)[, c("GRP", "N") := .(.GRP, .N), by = names(dat)][
                       N > 1, list(list(.I)), by = GRP]
    ##    GRP             V1
    ## 1:   1      1,4,5,6,9
    ## 2:   2           2,13
    ## 3:   3  3, 7, 8,10,11
    

    The basic idea is to create a column that "groups" the other columns (using .GRP) as well as a column that counts how many duplicate rows there are (using .N), then filtering anything that has more than one duplicate, and putting the "GRP" column into a list.

    0 讨论(0)
提交回复
热议问题