R data.table duplicate rows with a pair of columns

前端 未结 1 501
太阳男子
太阳男子 2020-12-22 10:55

data.table is very useful but I could not find an elegant way to solve the following problem. There are some closer answers out there, but none solved my problem. Lets say t

相关标签:
1条回答
  • 2020-12-22 11:59

    The linked answer ( https://stackoverflow.com/a/25151395/496803) is nearly a duplicate, and so is https://stackoverflow.com/a/25298863/496803 , but here goes again, with a slight twist:

    dt[!duplicated(data.table(pmin(Gene1,Gene2),pmax(Gene1,Gene2)))]
    
    #   Gene1 Gene2          Ens.ID.1         Ens.ID.2      CORR
    #1: FOXA1   MYC ENSG000000129.13. ENSG000000129.11 0.9953311
    #2:  EGFR   CD4     ENSG000000129 ENSG000000129.12 0.9947215
    

    If you have >2 or many keys to dedup by, you are probably best off converting to a long file, sorting, back to a wide file and then de-duplicating. Like so:

    dupvars <- c("Gene1","Gene2")
    sel <- !duplicated(
      dcast(
          melt(dt[, c(.SD,id=.(.I)), .SDcols=dupvars], id.vars="id")[
              order(id,value), grp := seq_len(.N), by=id],
          id ~ grp
      )[,-1])
    dt[sel,]
    
    0 讨论(0)
提交回复
热议问题