Can I use a list as a hash in R? If so, why is it so slow?

前端 未结 7 747
遇见更好的自我
遇见更好的自我 2020-11-29 23:43

Before using R, I used quite a bit of Perl. In Perl, I would often use hashes, and lookups of hashes are generally regarded as fast in Perl.

For example, the followi

相关标签:
7条回答
  • 2020-11-30 00:14

    You could try environments and/or the hash package by Christopher Brown (which happens to use environments under the hood).

    0 讨论(0)
  • 2020-11-30 00:14

    Your code is very un R-like and is one of the reasons it's so slow. I haven't optimized the code below for maximum speed, only R'ness.

    n <- 10000
    
    keys <- matrix( sample(letters, 3*n, replace = TRUE), nrow = 3 )
    keys <- apply(keys, 2, paste0, collapse = '')
    value <- floor(1000*runif(n))
    testHash <- as.list(value)
    names(testHash) <- keys
    
    keys <- sample(names(testHash), n, replace = TRUE)
    lookupValue = testHash[keys]
    print(data.frame('key', keys, 'lookup', unlist(lookupValue)))
    

    On my machine that runs almost instantaneously excluding the printing. Your code ran about the same speed you reported. Is it doing what you want? You could set n to 10 and just look at the output and testHash and see if that's it.

    NOTE on syntax: The apply above is simply a loop and those are slow in R. The point of those apply family commands is expressiveness. Many of the commands that follow could have been put inside a loop with apply and if it was a for loop that would be the temptation. In R take as much out of your loop as possible. Using apply family commands makes this more natural because the command is designed to represent the application of one function to a list of some sort as opposed to a generic loop (yes, I know apply could be used on more than one command).

    0 讨论(0)
  • 2020-11-30 00:15

    First off, as Vince and Dirk has said, you are not using hashes in your example code. A literal translation of the perl example would be

    #!/usr/bin/Rscript
    testHash <- new.env(hash = TRUE, size = 10000L)
    for(i in 1:10000) {
      key <- paste(sample(letters, 3, replace = TRUE), collapse = "")
      assign(key, floor(1000*runif(1)), envir = testHash)
    }
    
    keyArray <- ls(envir = testHash)
    keyLen <- length(keyArray)
    
    for(j in 1:10000) {
      key <- keyArray[sample(keyLen, 1)]
      lookupValue <- get(key, envir = testHash)
      cat(paste("key", key, "Lookup", lookupValue, "\n"))
    }
    

    which runs plenty fast on my machine, them main time being the setup. (Try it and post the timings.)

    But the real problem, as John said, is that you have to think vectors in R (like map in perl) and his solution is probably the best. If you do want to use hashes, consider

    keys <- sample(ls(envir = testHash), 10000, replace = TRUE)
    vals <- mget(keys, envir = testHash)
    

    after the same setup as above, which is near-instantaneous on my machine. To print them all try

    cat(paste(keys, vals), sep="\n")
    

    Hope this helps a little.

    Allan

    0 讨论(0)
  • 2020-11-30 00:20

    But an environment cannot contain another environment (quoted from Vince's answer).

    Maybe it was that way some time ago (I don't know) but this information seems not to be accurate anymore:

    > d <- new.env()
    > d$x <- new.env()
    > d$x$y = 20
    > d$x$y
    [1] 20
    

    So environments make a pretty capable map/dict now. Maybe you will miss the '[' operator, use the hash package in that case.

    This note taken from the hash package documentation may also be of interest:

    R is slowly moving toward a native implementation of hashes using enviroments, (cf. Extract. Access to environments using $ and [[ has been available for some time and recently objects can inherit from environments, etc. But many features that make hashes/dictionaries great are still lacking, such as the slice operation, [.

    0 讨论(0)
  • 2020-11-30 00:26

    The underlying reason is that R lists with named elements are not hashed. Hash lookups are O(1), because during insert the key is converted to an integer using a hash function, and then the value put in the space hash(key) % num_spots of an array num_spots long (this is a big simplification and avoids the complexity of dealing with collisions). Lookups of the key just require hashing the key to find the value's position (which is O(1), versus a O(n) array lookup). R lists use name lookups which are O(n).

    As Dirk says, use the hash package. A huge limitation with this is that it uses environments (which are hashed) and overriding of [ methods to mimic hash tables. But an environment cannot contain another environment, so you cannot have nested hashes with the hash function.

    A while back I worked on implementing a pure hash table data structure in C/R that could be nested, but it went on my project back burner while I worked on other things. It would be nice to have though :-)

    0 讨论(0)
  • 2020-11-30 00:36

    If you are trying to hash 10,000,000+ things in R using the hash package, then building the hash takes a very very long time. It crashed R, despite the fact that the data is taking less than 1/3 of my memory.

    I had much better performance with the package data.table using setkey. If you are not familiar with data.table and setkey, you might start here: https://cran.r-project.org/web/packages/data.table/vignettes/datatable-keys-fast-subset.html

    I realize the original question referred to 10,000 things, but google directed me here a couple days ago. I tried to use the hash package and had a really hard time. Then I found this blog post which suggests that building the hash can take hours for 10M+ things and this aligns with my experience:
    https://appsilon.com/fast-data-lookups-in-r-dplyr-vs-data-table/?utm_campaign=News&utm_medium=Community&utm_source=DataCamp.com

    0 讨论(0)
提交回复
热议问题