data.table with two string columns of set elements, extract unique rows with each row unsorted

前端 未结 3 1087
礼貌的吻别
礼貌的吻别 2020-11-27 07:45

Suppose I have a data.table like this:

Table:

V1 V2
 A  B
 C  D
 C  A
 B  A
 D  C

I want each row to be regarded as a set, which me

相关标签:
3条回答
  • For just two columns you can use the following trick:

    dt = data.table(a = letters[1:5], b = letters[5:1])
    #   a b
    #1: a e
    #2: b d
    #3: c c
    #4: d b
    #5: e a
    
    dt[dt[, .I[1], by = list(pmin(a, b), pmax(a, b))]$V1]
    #   a b
    #1: a e
    #2: b d
    #3: c c
    
    0 讨论(0)
  • 2020-11-27 08:28

    Borrowing (probably unrealistic) data from a dupe:

    library(data.table)
    size <- 118000000
    key1 <- sample( LETTERS, size, replace=TRUE, prob=runif(length(LETTERS), 0.0, 5.0) )
    key2 <- sample( LETTERS, size, replace=TRUE, prob=runif(length(LETTERS), 0.0, 5.0) )
    val <- runif(size, 0.0, 5.0)
    
    dt <- data.table(key1, key2, val, stringsAsFactors=FALSE)
    

    Here's a fast way if your data looks like this:

    # eddi's answer
    system.time(res1 <- dt[dt[, .I[1], by=.(pmin(key1, key2), pmax(key1, key2))]$V1])
    #    user  system elapsed 
    #  101.79    3.01  107.98 
    
    # optimized for this data
    system.time({
      dt2 <- unique(dt, by=c("key1", "key2"))[key1 > key2, c("key1", "key2") := .(key2, key1)]
      res2 <- unique(dt2, by=c("key1", "key2")) 
    })
    #    user  system elapsed 
    #    8.50    1.16    4.93 
    
    fsetequal(copy(res1)[key1 > key2, c("key1", "key2") := .(key2, key1)], res2)
    # [1] TRUE
    

    Data like this seems unlikely if it pertains to covariances, since you should have at most one duplicate (ie, A-B with B-A).

    0 讨论(0)
  • 2020-11-27 08:29

    Here is the simple way of removing duplicate rows.

    delRows = NULL # the rows to be removed
    for(i in 1:nrow(tab)){
      j = which(tab$V1 == tab$V2[i] & tab$V2 == tab$V1[i])
      j = j [j > i]
      if (length(j) > 0){
        delRows = c(delRows, j)
      }
    }
    tab = tab[-delRows,]
    

    The result is, Before,

    > tab
      V1 V2
    1  A  B
    2  C  D
    3  C  A
    4  B  A
    5  D  C
    

    After,

    > tab
      V1 V2
    1  A  B
    2  C  D
    3  C  A
    
    0 讨论(0)
提交回复
热议问题