Here is an example data.table
dt <- data.table(col1 = c(\'A\', \'A\', \'B\', \'C\', \'C\', \'D\'), col2 = c(NA, \'dog\', \'cat\', \'jeep\', \'porsch\', NA))
You missed the parenthesis (maybe a typo), I suppose it should be length(col1) > 1
; And also used ifelse
on a scalar condition which will not work as you expect it to (only the first element from the vector is picked up); If you want to remove NA values from a group when there are non NAs, you can use if/else
:
dt[, .(col2 = if(all(is.na(col2))) NA_character_ else na.omit(col2)), by = col1]
# col1 col2
#1: A dog
#2: B cat
#3: C jeep
#4: C porsch
#5: D NA
group by col1, then if group has more than one row and one of them is NA, remove it.
Use an anti-join:
dt[!dt[, if (.N > 1L) .SD[NA_integer_], by=col1], on=names(dt)]
col1 col2
1: A dog
2: B cat
3: C jeep
4: C porsch
5: D NA
Benchmark from @thela, but assuming there are no (full) dupes in the original data:
set.seed(1)
dt2a <- data.table(col1=sample(1:5e5,5e6,replace=TRUE), col2=sample(c(1:8,NA),5e6,replace=TRUE))
dt2 = unique(dt2a)
system.time(res_thela <- dt2[-dt2[, .I[any(!is.na(col2)) & is.na(col2)], by=col1]$V1])
# user system elapsed
# 0.73 0.06 0.81
system.time(res_psidom <- dt2[, .(col2 = if(all(is.na(col2))) NA_integer_ else na.omit(col2)), by = col1])
# user system elapsed
# 2.86 0.03 2.89
system.time(res <- dt2[!dt2[, .N, by=col1][N > 1L, !"N"][, col2 := dt2$col2[NA_integer_]], on=names(dt2)])
# user system elapsed
# 0.39 0.01 0.41
fsetequal(res, res_thela) # TRUE
fsetequal(res, res_psidom) # TRUE
I changed a little for speed. With a having= argument, this might become faster and more legible.
An attempt to find all the NA
cases in groups where there is also a non-NA
value, and then remove those rows:
dt[-dt[, .I[any(!is.na(col2)) & is.na(col2)], by=col1]$V1]
# col1 col2
#1: A dog
#2: B cat
#3: C jeep
#4: C porsch
#5: D NA
Seems quicker, though I'm sure someone is going to turn up with an even quicker version shortly:
set.seed(1)
dt2 <- data.table(col1=sample(1:5e5,5e6,replace=TRUE), col2=sample(c(1:8,NA),5e6,replace=TRUE))
system.time(dt2[-dt2[, .I[any(!is.na(col2)) & is.na(col2)], by=col1]$V1])
# user system elapsed
# 1.49 0.02 1.51
system.time(dt2[, .(col2 = if(all(is.na(col2))) NA_integer_ else na.omit(col2)), by = col1])
# user system elapsed
# 4.49 0.04 4.54