Before using R, I used quite a bit of Perl. In Perl, I would often use hashes, and lookups of hashes are generally regarded as fast in Perl.
For example, the followi
I'm a bit of an R hack, but I'm an empiricist so I'll share some things I have observed and let those with greater theoretical understanding of R shed light into the whys.
R seems much slower using standard
streams than Perl. Since stdin and
stout are much more commonly used in
Perl I assume it has optimizations
around how it does these things. So in R I
find it MUCH faster to read/write text using the built in
functions (e.g write.table
).
As others have said, vector operations in R are faster than loops... and w.r.t. speed, most apply() family syntax is simply a pretty wrapper on a loop.
Indexed things work faster than non-indexed. (Obvious, I know.) The data.table package supports indexing of data frame type objects.
I've never used hash environments like @Allen illustrated (and I've never inhaled hash... as far as you know)
Some of the syntax you used works, but could be tightened up. I don't think any of this really matters for speed, but the code's a little more readable. I don't write very tight code, but I edited a few things like changing floor(1000*runif(1))
to sample(1:1000, n, replace=T)
. I don't mean to be pedantic, I just wrote it the way I would do it from scratch.
So with that in mind I decided to test the hash approach that @allen used (because it's novel to me) against my "poor man's hash" which I've created using an indexed data.table as a lookup table. I'm not 100% sure that what @allen and I are doing is exactly what you did in Perl because my Perl is pretty rusty. But I think the two methods below do the same thing. We both sample the second set of keys from the keys in the 'hash' as this prevents hash misses. You'd want to test how these examples handle hash dupes as I have not given that much thought.
require(data.table)
dtTest <- function(n) {
makeDraw <- function(x) paste(sample(letters, 3, replace=T), collapse="")
key <- sapply(1:n, makeDraw)
value <- sample(1:1000, n, replace=T)
myDataTable <- data.table(key, value, key='key')
newKeys <- sample(as.character(myDataTable$key), n, replace = TRUE)
lookupValues <- myDataTable[newKeys]
strings <- paste("key", lookupValues$key, "Lookup", lookupValues$value )
write.table(strings, file="tmpout", quote=F, row.names=F, col.names=F )
}
#
hashTest <- function(n) {
testHash <- new.env(hash = TRUE, size = n)
for(i in 1:n) {
key <- paste(sample(letters, 3, replace = TRUE), collapse = "")
assign(key, floor(1000*runif(1)), envir = testHash)
}
keyArray <- ls(envir = testHash)
keyLen <- length(keyArray)
keys <- sample(ls(envir = testHash), n, replace = TRUE)
vals <- mget(keys, envir = testHash)
strings <- paste("key", keys, "Lookup", vals )
write.table(strings, file="tmpout", quote=F, row.names=F, col.names=F )
}
if I run each method using 100,000 draws, I get something like this:
> system.time( dtTest(1e5))
user system elapsed
2.750 0.030 2.881
> system.time(hashTest(1e5))
user system elapsed
3.670 0.030 3.861
Keep in mind that this is still considerably slower than the Perl code which, on my PC, seems to run 100K samples in well under a second.
I hope the above example helps. And if you have any questions as to why
maybe @allen, @vince, and @dirk will be able to answer ;)
After I typed the above, I realized I had not tested what @john did. So, what the hell, let's do all 3. I changed the code from @john to use write.table() and here's his code:
johnsCode <- function(n){
keys = sapply(character(n), function(x) paste(letters[ceiling(26*runif(3))],
collapse=''))
value <- floor(1000*runif(n))
testHash <- as.list(value)
names(testHash) <- keys
keys <- names(testHash)[ceiling(n*runif(n))]
lookupValue = testHash[keys]
strings <- paste("key", keys, "Lookup", lookupValue )
write.table(strings, file="tmpout", quote=F, row.names=F, col.names=F )
}
and the run time:
> system.time(johnsCode(1e5))
user system elapsed
2.440 0.040 2.544
And there you have it. @john writes tight/fast R code!