Within ID, check for matches/differences

前端未结

关注

 4  687

I have a large dataset, over 1.5 million rows, from 600k unique subjects, so a number of subjects have multiple rows. I am trying to find the cases where the one of the subj

相关标签:

4条回答

死守一世寂寞

2021-01-12 00:27

With such large volume I propose some other solution, based on comparisons and use power of vector operations in R:

test <- test[order(test$ID), ]
n <- nrow(test)
ind <- test$ID[-1] == test$ID[-n] & test$DOB[-1] != test$DOB[-n]
unique(test$ID[c(FALSE,ind)])

For test data timing is similar to Joris idea, but for large data:

test2 <- data.frame(
    ID = rep(1:600000,3),
    DOB = "2000-01-01",
    stringsAsFactors=FALSE
)
test2$DOB[sample.int(nrow(test2),5000)] <- "2000-01-02"

system.time(resA<-{
    x <- unique(test2[c("ID","DOB")])
    x$ID[duplicated(x$ID)]
})
#   user  system elapsed 
#   7.44    0.14    7.58 

system.time(resB <- {
    test2 <- test2[order(test2$ID), ]
    n <- nrow(test2)
    ind <- test2$ID[-1] == test2$ID[-n] & test2$DOB[-1] != test2$DOB[-n]
    unique(test2$ID[c(FALSE,ind)])
})
#   user  system elapsed 
#   0.76    0.04    0.81 

all.equal(sort(resA),sort(resB))
# [1] TRUE

0 讨论(0)

小蘑菇

2021-01-12 00:28

One approach using plyr:

library(plyr)
  zz <- ddply(test, "ID", summarise, dups = length(unique(DOB)))
  zz[zz$dups > 1 ,]

And if base R is your thing, using aggregate()

zzz <- aggregate(DOB ~ ID, data = test, FUN = function(x) length(unique(x)))
zzz[zzz$DOB > 1 ,]

0 讨论(0)

暗喜

2021-01-12 00:30

Using base functions, the fastest solution would be something like :

> x <- unique(test[c("ID","DOB")])
> x$ID[duplicated(x$ID)]
[1] 2

Timing :

n <- 1000
system.time(replicate(n,{
  x <- unique(test[c("ID","DOB")])
  x$ID[duplicated(x$ID)]
 }))
   user  system elapsed 
   0.70    0.00    0.71 

system.time(replicate(n,{
  DOBError(data)
}))
   user  system elapsed 
   1.69    0.00    1.69 

system.time(replicate(n,{
  zzz <- aggregate(DOB ~ ID, data = test, FUN = function(x) length(unique(x)))
  zzz[zzz$DOB > 1 ,]
}))
   user  system elapsed 
   4.23    0.02    4.27 

system.time(replicate(n,{
   zz <- ddply(test, "ID", summarise, dups = length(unique(DOB)))
   zz[zz$dups > 1 ,]
}))
   user  system elapsed 
   6.63    0.01    6.64

0 讨论(0)

长发绾君心

2021-01-12 00:33

DOBError <- function(data){

     count <- unlist(lapply(split(test, test$ID), 
        function(x)length(unique(x$DOB))))

     return(names(count)[count > 1])

}


DOBError(data)

[1] "2"

0 讨论(0)