Why does R's duplicated perform better on sorted data?

前端 未结 1 1058
暖寄归人
暖寄归人 2021-02-07 03:35

While comparing the efficiency of two functions in an answer to Check if list contains another list in R, I stumbled upon an interesting result. Sorting greatly increases the ef

相关标签:
1条回答
  • 2021-02-07 03:53

    The major factor is the rate of CPU cache misses, and as size scales, more expensive page faults. Duplication is checked by reference to a simple hash table. If the portion of the hash table being queried is already in the high speed memory cache, then these lookups are much faster. For small vectors, the corresponding hash table will entirely fit into the high speed memory cache, so the order of access is not significant, which is what you saw in your first benchmark.

    For larger vectors, only some blocks of the hash table will fit into the cache at any given time. If duplicates are consecutive, then the portion of the hash table needed for lookup will already be in the cache for the subsequent lookups. This is why performance increases by number of duplicates for larger vectors. For extremely large vectors, the hash table may not even entirely fit into available physical memory and be paged out to the disk, making the difference even more noticeable.

    To test this out, let's use the original post's s2 vector and its sorted version, but also test out just having the duplicates next to each other but otherwise unsorted.

    # samples as in original post
    s2 <- sample(10^6, 10^7, replace = TRUE)
    s2_sort <- sort(s2)
    
    # in the same order as s2, but with duplicates brought together
    u2 <- unique(s2)
    t2 <- rle(s2_sort)
    s2_chunked <- rep(u2,times=t2$length[match(u2,t2$values)])
    

    Let's also consider just sorting by hash value. I'll approximate the hash coding in R, but we are dealing with double sized values here rather than being able to use unsigned longs so we won't be able to use bitwise ops.

    # in the order of hash value
    K <- ceiling(log2(length(s2)*2))
    M <- 2^K
    h <- ((3141592653 * s2) %% 2^32)/2^(32-K)
    ho <- order(h)
    s2_hashordered <- s2[ho]
    

    What we expect to see is that performance is similar for s2_sort and s2_chunked and even better for s2_hashordered. In each of these cases we've attempted to minimize cache misses.

    microbenchmark(
     duplicated(s2), 
     duplicated(s2_sort), 
     duplicated(s2_chunked),
     duplicated(s2_hashordered),
     times=10)
    
    Unit: milliseconds
                           expr      min       lq     mean   median       uq      max neval cld
                 duplicated(s2) 664.5652 677.9340 690.0001 692.3104 703.8312 711.1538    10   c
            duplicated(s2_sort) 245.6511 251.3861 268.7433 276.2330 279.2518 284.6589    10  b 
         duplicated(s2_chunked) 240.0688 243.0151 255.3857 248.1327 276.3141 283.4298    10  b 
     duplicated(s2_hashordered) 166.8814 169.9423 185.9345 185.1822 202.7478 209.0383    10 a  
    
    0 讨论(0)
提交回复
热议问题