R data.table remove rows where one column is duplicated if another column is NA

后端 未结 3 1648
伪装坚强ぢ
伪装坚强ぢ 2021-01-26 20:30

Here is an example data.table

dt <- data.table(col1 = c(\'A\', \'A\', \'B\', \'C\', \'C\', \'D\'), col2 = c(NA, \'dog\', \'cat\', \'jeep\', \'porsch\', NA))

         


        
3条回答
  •  一生所求
    2021-01-26 20:59

    group by col1, then if group has more than one row and one of them is NA, remove it.

    Use an anti-join:

    dt[!dt[, if (.N > 1L) .SD[NA_integer_], by=col1], on=names(dt)]
    
       col1   col2
    1:    A    dog
    2:    B    cat
    3:    C   jeep
    4:    C porsch
    5:    D     NA
    

    Benchmark from @thela, but assuming there are no (full) dupes in the original data:

    set.seed(1)
    dt2a <- data.table(col1=sample(1:5e5,5e6,replace=TRUE), col2=sample(c(1:8,NA),5e6,replace=TRUE))
    dt2 = unique(dt2a)
    
    system.time(res_thela <- dt2[-dt2[, .I[any(!is.na(col2)) & is.na(col2)], by=col1]$V1])
    #    user  system elapsed 
    #    0.73    0.06    0.81
    
    system.time(res_psidom <- dt2[, .(col2 = if(all(is.na(col2))) NA_integer_ else na.omit(col2)), by = col1])
    #    user  system elapsed 
    #    2.86    0.03    2.89 
    
    system.time(res <- dt2[!dt2[, .N, by=col1][N > 1L, !"N"][, col2 := dt2$col2[NA_integer_]], on=names(dt2)])
    #    user  system elapsed 
    #    0.39    0.01    0.41 
    
    fsetequal(res, res_thela) # TRUE
    fsetequal(res, res_psidom) # TRUE
    

    I changed a little for speed. With a having= argument, this might become faster and more legible.

提交回复
热议问题