Can I use a list as a hash in R? If so, why is it so slow?

前端 未结 7 762
遇见更好的自我
遇见更好的自我 2020-11-29 23:43

Before using R, I used quite a bit of Perl. In Perl, I would often use hashes, and lookups of hashes are generally regarded as fast in Perl.

For example, the followi

相关标签:
7条回答
  • 2020-11-30 00:38

    I'm a bit of an R hack, but I'm an empiricist so I'll share some things I have observed and let those with greater theoretical understanding of R shed light into the whys.

    • R seems much slower using standard streams than Perl. Since stdin and stout are much more commonly used in Perl I assume it has optimizations around how it does these things. So in R I find it MUCH faster to read/write text using the built in functions (e.g write.table).

    • As others have said, vector operations in R are faster than loops... and w.r.t. speed, most apply() family syntax is simply a pretty wrapper on a loop.

    • Indexed things work faster than non-indexed. (Obvious, I know.) The data.table package supports indexing of data frame type objects.

    • I've never used hash environments like @Allen illustrated (and I've never inhaled hash... as far as you know)

    • Some of the syntax you used works, but could be tightened up. I don't think any of this really matters for speed, but the code's a little more readable. I don't write very tight code, but I edited a few things like changing floor(1000*runif(1)) to sample(1:1000, n, replace=T). I don't mean to be pedantic, I just wrote it the way I would do it from scratch.

    So with that in mind I decided to test the hash approach that @allen used (because it's novel to me) against my "poor man's hash" which I've created using an indexed data.table as a lookup table. I'm not 100% sure that what @allen and I are doing is exactly what you did in Perl because my Perl is pretty rusty. But I think the two methods below do the same thing. We both sample the second set of keys from the keys in the 'hash' as this prevents hash misses. You'd want to test how these examples handle hash dupes as I have not given that much thought.

    require(data.table)
    
    dtTest <- function(n) {
    
      makeDraw <- function(x) paste(sample(letters, 3, replace=T), collapse="")
      key <- sapply(1:n, makeDraw)
      value <- sample(1:1000, n, replace=T)
    
      myDataTable <- data.table(key, value,  key='key')
    
      newKeys <- sample(as.character(myDataTable$key), n, replace = TRUE)
    
      lookupValues <- myDataTable[newKeys]
    
      strings <- paste("key", lookupValues$key, "Lookup", lookupValues$value )
      write.table(strings, file="tmpout", quote=F, row.names=F, col.names=F )
    }
    

    #

    hashTest <- function(n) {
    
      testHash <- new.env(hash = TRUE, size = n)
    
      for(i in 1:n) {
        key <- paste(sample(letters, 3, replace = TRUE), collapse = "")
        assign(key, floor(1000*runif(1)), envir = testHash)
      }
    
      keyArray <- ls(envir = testHash)
      keyLen <- length(keyArray)
    
      keys <- sample(ls(envir = testHash), n, replace = TRUE)
      vals <- mget(keys, envir = testHash)
    
      strings <- paste("key", keys, "Lookup", vals )
      write.table(strings, file="tmpout", quote=F, row.names=F, col.names=F )
    
      }
    

    if I run each method using 100,000 draws, I get something like this:

    > system.time(  dtTest(1e5))
       user  system elapsed 
      2.750   0.030   2.881 
    > system.time(hashTest(1e5))
       user  system elapsed 
      3.670   0.030   3.861 
    

    Keep in mind that this is still considerably slower than the Perl code which, on my PC, seems to run 100K samples in well under a second.

    I hope the above example helps. And if you have any questions as to why maybe @allen, @vince, and @dirk will be able to answer ;)

    After I typed the above, I realized I had not tested what @john did. So, what the hell, let's do all 3. I changed the code from @john to use write.table() and here's his code:

    johnsCode <- function(n){
      keys = sapply(character(n), function(x) paste(letters[ceiling(26*runif(3))],
        collapse=''))
      value <- floor(1000*runif(n))
      testHash <- as.list(value)
      names(testHash) <- keys
    
      keys <- names(testHash)[ceiling(n*runif(n))]
      lookupValue = testHash[keys]
    
      strings <- paste("key", keys, "Lookup", lookupValue )
      write.table(strings, file="tmpout", quote=F, row.names=F, col.names=F )
    }
    

    and the run time:

    > system.time(johnsCode(1e5))
       user  system elapsed 
      2.440   0.040   2.544 
    

    And there you have it. @john writes tight/fast R code!

    0 讨论(0)
提交回复
热议问题