Can I use a list as a hash in R? If so, why is it so slow?

前端未结

关注

 7  761

Before using R, I used quite a bit of Perl. In Perl, I would often use hashes, and lookups of hashes are generally regarded as fast in Perl.

For example, the followi

相关标签:

7条回答

执笔经年

2020-11-30 00:14

You could try environments and/or the hash package by Christopher Brown (which happens to use environments under the hood).

0 讨论(0)
发布评论:

提交评论
- 加载中...
醉话见心

2020-11-30 00:14
Your code is very un R-like and is one of the reasons it's so slow. I haven't optimized the code below for maximum speed, only R'ness.
```
n <- 10000

keys <- matrix( sample(letters, 3*n, replace = TRUE), nrow = 3 )
keys <- apply(keys, 2, paste0, collapse = '')
value <- floor(1000*runif(n))
testHash <- as.list(value)
names(testHash) <- keys

keys <- sample(names(testHash), n, replace = TRUE)
lookupValue = testHash[keys]
print(data.frame('key', keys, 'lookup', unlist(lookupValue)))
```
On my machine that runs almost instantaneously excluding the printing. Your code ran about the same speed you reported. Is it doing what you want? You could set n to 10 and just look at the output and testHash and see if that's it.

NOTE on syntax: The apply above is simply a loop and those are slow in R. The point of those apply family commands is expressiveness. Many of the commands that follow could have been put inside a loop with apply and if it was a for loop that would be the temptation. In R take as much out of your loop as possible. Using apply family commands makes this more natural because the command is designed to represent the application of one function to a list of some sort as opposed to a generic loop (yes, I know apply could be used on more than one command).
0 讨论(0)
发布评论:

提交评论
- 加载中...
旧时难觅i

2020-11-30 00:15
First off, as Vince and Dirk has said, you are not using hashes in your example code. A literal translation of the perl example would be
```
#!/usr/bin/Rscript
testHash <- new.env(hash = TRUE, size = 10000L)
for(i in 1:10000) {
  key <- paste(sample(letters, 3, replace = TRUE), collapse = "")
  assign(key, floor(1000*runif(1)), envir = testHash)
}

keyArray <- ls(envir = testHash)
keyLen <- length(keyArray)

for(j in 1:10000) {
  key <- keyArray[sample(keyLen, 1)]
  lookupValue <- get(key, envir = testHash)
  cat(paste("key", key, "Lookup", lookupValue, "\n"))
}
```
which runs plenty fast on my machine, them main time being the setup. (Try it and post the timings.)

But the real problem, as John said, is that you have to think vectors in R (like map in perl) and his solution is probably the best. If you do want to use hashes, consider
```
keys <- sample(ls(envir = testHash), 10000, replace = TRUE)
vals <- mget(keys, envir = testHash)
```
after the same setup as above, which is near-instantaneous on my machine. To print them all try
```
cat(paste(keys, vals), sep="\n")
```
Hope this helps a little.

Allan
0 讨论(0)
发布评论:

提交评论
- 加载中...
不知归路

2020-11-30 00:20
But an environment cannot contain another environment (quoted from Vince's answer).

Maybe it was that way some time ago (I don't know) but this information seems not to be accurate anymore:
```
> d <- new.env()
> d$x <- new.env()
> d$x$y = 20
> d$x$y
[1] 20
```
So environments make a pretty capable map/dict now. Maybe you will miss the '[' operator, use the hash package in that case.

This note taken from the hash package documentation may also be of interest:

R is slowly moving toward a native implementation of hashes using enviroments, (cf. Extract. Access to environments using $ and [[ has been available for some time and recently objects can inherit from environments, etc. But many features that make hashes/dictionaries great are still lacking, such as the slice operation, [.
0 讨论(0)
发布评论:

提交评论
- 加载中...
我寻月下人不归

2020-11-30 00:26

The underlying reason is that R lists with named elements are not hashed. Hash lookups are O(1), because during insert the key is converted to an integer using a hash function, and then the value put in the space hash(key) % num_spots of an array num_spots long (this is a big simplification and avoids the complexity of dealing with collisions). Lookups of the key just require hashing the key to find the value's position (which is O(1), versus a O(n) array lookup). R lists use name lookups which are O(n).

As Dirk says, use the hash package. A huge limitation with this is that it uses environments (which are hashed) and overriding of [ methods to mimic hash tables. But an environment cannot contain another environment, so you cannot have nested hashes with the hash function.

A while back I worked on implementing a pure hash table data structure in C/R that could be nested, but it went on my project back burner while I worked on other things. It would be nice to have though :-)

0 讨论(0)
发布评论:

提交评论
- 加载中...
有刺的猬

2020-11-30 00:36

If you are trying to hash 10,000,000+ things in R using the hash package, then building the hash takes a very very long time. It crashed R, despite the fact that the data is taking less than 1/3 of my memory.

I had much better performance with the package data.table using setkey. If you are not familiar with data.table and setkey, you might start here: https://cran.r-project.org/web/packages/data.table/vignettes/datatable-keys-fast-subset.html

I realize the original question referred to 10,000 things, but google directed me here a couple days ago. I tried to use the hash package and had a really hard time. Then I found this blog post which suggests that building the hash can take hours for 10M+ things and this aligns with my experience:
https://appsilon.com/fast-data-lookups-in-r-dplyr-vs-data-table/?utm_campaign=News&utm_medium=Community&utm_source=DataCamp.com

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页