Within ID, check for matches/differences

前端 未结 4 657
自闭症患者
自闭症患者 2021-01-12 00:11

I have a large dataset, over 1.5 million rows, from 600k unique subjects, so a number of subjects have multiple rows. I am trying to find the cases where the one of the subj

相关标签:
4条回答
  • 2021-01-12 00:27

    With such large volume I propose some other solution, based on comparisons and use power of vector operations in R:

    test <- test[order(test$ID), ]
    n <- nrow(test)
    ind <- test$ID[-1] == test$ID[-n] & test$DOB[-1] != test$DOB[-n]
    unique(test$ID[c(FALSE,ind)])
    

    For test data timing is similar to Joris idea, but for large data:

    test2 <- data.frame(
        ID = rep(1:600000,3),
        DOB = "2000-01-01",
        stringsAsFactors=FALSE
    )
    test2$DOB[sample.int(nrow(test2),5000)] <- "2000-01-02"
    
    system.time(resA<-{
        x <- unique(test2[c("ID","DOB")])
        x$ID[duplicated(x$ID)]
    })
    #   user  system elapsed 
    #   7.44    0.14    7.58 
    
    system.time(resB <- {
        test2 <- test2[order(test2$ID), ]
        n <- nrow(test2)
        ind <- test2$ID[-1] == test2$ID[-n] & test2$DOB[-1] != test2$DOB[-n]
        unique(test2$ID[c(FALSE,ind)])
    })
    #   user  system elapsed 
    #   0.76    0.04    0.81 
    
    all.equal(sort(resA),sort(resB))
    # [1] TRUE
    
    0 讨论(0)
  • 2021-01-12 00:28

    One approach using plyr:

    library(plyr)
      zz <- ddply(test, "ID", summarise, dups = length(unique(DOB)))
      zz[zz$dups > 1 ,]
    

    And if base R is your thing, using aggregate()

    zzz <- aggregate(DOB ~ ID, data = test, FUN = function(x) length(unique(x)))
    zzz[zzz$DOB > 1 ,]
    
    0 讨论(0)
  • 2021-01-12 00:30

    Using base functions, the fastest solution would be something like :

    > x <- unique(test[c("ID","DOB")])
    > x$ID[duplicated(x$ID)]
    [1] 2
    

    Timing :

    n <- 1000
    system.time(replicate(n,{
      x <- unique(test[c("ID","DOB")])
      x$ID[duplicated(x$ID)]
     }))
       user  system elapsed 
       0.70    0.00    0.71 
    
    system.time(replicate(n,{
      DOBError(data)
    }))
       user  system elapsed 
       1.69    0.00    1.69 
    
    system.time(replicate(n,{
      zzz <- aggregate(DOB ~ ID, data = test, FUN = function(x) length(unique(x)))
      zzz[zzz$DOB > 1 ,]
    }))
       user  system elapsed 
       4.23    0.02    4.27 
    
    system.time(replicate(n,{
       zz <- ddply(test, "ID", summarise, dups = length(unique(DOB)))
       zz[zz$dups > 1 ,]
    }))
       user  system elapsed 
       6.63    0.01    6.64 
    
    0 讨论(0)
  • 2021-01-12 00:33
    DOBError <- function(data){
    
         count <- unlist(lapply(split(test, test$ID), 
            function(x)length(unique(x$DOB))))
    
         return(names(count)[count > 1])
    
    }
    
    
    DOBError(data)
    
    [1] "2"
    
    0 讨论(0)
提交回复
热议问题