finding unique vector elements in a list efficiently

后端 未结 2 2202
心在旅途
心在旅途 2021-02-15 08:09

I have a list of numerical vectors, and I need to create a list containing only one copy of each vector. There isn\'t a list method for the identical function, so I wrote a func

相关标签:
2条回答
  • 2021-02-15 08:40

    You could hash each of the vectors and then use !duplicated() to identify unique elements of the resultant character vector:

    library(digest)  
    
    ## Some example data
    x <- 1:44
    y <- 2:10
    z <- rnorm(10)
    ll <- list(x,y,x,x,x,z,y)
    
    ll[!duplicated(sapply(ll, digest))]
    # [[1]]
    #  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
    # [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
    # 
    # [[2]]
    # [1]  2  3  4  5  6  7  8  9 10
    # 
    # [[3]]
    #  [1]  1.24573610 -0.48894189 -0.18799758 -1.30696395 -0.05052373  0.94088670
    #  [7] -0.20254574 -1.08275938 -0.32937153  0.49454570
    

    To see at a glance why this works, here's what the hashes look like:

    sapply(ll, digest)
    [1] "efe1bc7b6eca82ad78ac732d6f1507e7" "fd61b0fff79f76586ad840c9c0f497d1"
    [3] "efe1bc7b6eca82ad78ac732d6f1507e7" "efe1bc7b6eca82ad78ac732d6f1507e7"
    [5] "efe1bc7b6eca82ad78ac732d6f1507e7" "592e2e533582b2bbaf0bb460e558d0a5"
    [7] "fd61b0fff79f76586ad840c9c0f497d1"
    
    0 讨论(0)
  • 2021-02-15 08:56

    As per @JoshuaUlrich and @thelatemail, ll[!duplicated(ll)] works just fine.
    And thus, so should unique(ll) I previously suggested a method using sapply with the idea of not checking every element in the list (I deleted that answer, as I think using unique makes more sense)

    Since efficiency is a goal, we should benchmark these.

    # Let's create some sample data
    xx <- lapply(rep(100,15), sample)
    ll <- as.list(sample(xx, 1000, T))
    ll
    

    Putting it up against some becnhmarks

    fun1 <- function(ll) {
      ll[c(TRUE, !sapply(2:length(ll), function(i) ll[i] %in% ll[1:(i-1)]))]
    }
    
    fun2 <- function(ll) {
      ll[!duplicated(sapply(ll, digest))]
    }
    
    fun3 <- function(ll)  {
      ll[!duplicated(ll)]
    }
    
    fun4 <- function(ll)  {
      unique(ll)
    }
    
    #Make sure all the same
    all(identical(fun1(ll), fun2(ll)), identical(fun2(ll), fun3(ll)), 
        identical(fun3(ll), fun4(ll)), identical(fun4(ll), fun1(ll)))
    # [1] TRUE
    
    
    library(rbenchmark)
    
    benchmark(digest=fun2(ll), duplicated=fun3(ll), unique=fun4(ll), replications=100, order="relative")[, c(1, 3:6)]
    
            test elapsed relative user.self sys.self
    3     unique   0.048    1.000     0.049    0.000
    2 duplicated   0.050    1.042     0.050    0.000
    1     digest   8.427  175.563     8.415    0.038
    # I took out fun1, since when ll is large, it ran extremely slow
    

    Fastest Option:

    unique(ll)
    
    0 讨论(0)
提交回复
热议问题