R: Checking if a set of variables forms a unique index

后端 未结 3 1789
夕颜
夕颜 2021-01-19 20:42

I have a large dataframe and I want to check whether the values a set of (factor) variables uniquely identifies each row of the data or not.

My current strategy is t

相关标签:
3条回答
  • 2021-01-19 21:19

    Perhaps anyDuplicated:

    anyDuplicated( dfTemp[, c("Var1", "Var2", "Var3") ] )
    

    or using dplyr:

    dfTemp %.% select(Var1, Var2, Var3) %.% anyDuplicated()
    

    This is still going to be wasteful though because anyDuplicated will first paste the columns into a character vector.

    0 讨论(0)
  • 2021-01-19 21:21

    How about:

    length(unique(paste(dfTemp$var1, dfTemp$var2, dfTemp$var3)))==nrow(dfTemp)
    

    Paste variables into one string, get unique, and compare the length of this vector with number of rows in your dataframe.

    0 讨论(0)
  • 2021-01-19 21:27

    The data.table package provides very fast duplicated and unique methods for data.tables. It also has a by= argument where you can provide the columns on which the duplicated/unique results should be computed from.

    Here's an example of a large data.frame:

    require(data.table)
    set.seed(45L)
    ## use setDT(dat) if your data is a data.frame, 
    ## to convert it to a data.table by reference
    dat <- data.table(var1=sample(100, 1e7, TRUE), 
                     var2=sample(letters, 1e7, TRUE), 
                     var3=sample(as.numeric(sample(c(-100:100, NA), 1e7,TRUE))))
    
    system.time(any(duplicated(dat)))
    #  user  system elapsed
    # 1.632   0.007   1.671
    

    This takes 25 seconds using anyDuplicated.data.frame.

    # if you want to calculate based on just var1 and var2
    system.time(any(duplicated(dat, by=c("var1", "var2"))))
    #  user  system elapsed
    # 0.492   0.001   0.495
    

    This takes 7.4 seconds using anyDuplicated.data.frame.

    0 讨论(0)
提交回复
热议问题