I am looking for an efficient way to create unique, numeric IDs for some synthetic data I\'m generating.
Right now, I simply have a function that emits and incremen
I like to use the proto
package for small OO programming. Under the hood, it uses environments in a similar fashion to what Martin Morgan illustrated.
# this defines your class
library(proto)
Counter <- proto(idCounter = 0L)
Counter$emitID <- function(self = .) {
id <- formatC(self$idCounter, width = 9, flag = 0, format = "d")
self$idCounter <- self$idCounter + 1L
return(id)
}
# This creates an instance (or you can use `Counter` directly as a singleton)
mycounter <- Counter$proto()
# use it:
mycounter$emitID()
# [1] "000000000"
mycounter$emitID()
# [1] "000000001"
A non-global version of the counter uses lexical scope to encapsulate idCounter
with the increment function
emitID <- local({
idCounter <- -1L
function(){
idCounter <<- idCounter + 1L # increment
formatC(idCounter, width=9, flag=0, format="d") # format & return
}
})
and then
> emitID()
[1] "000000000"
> emitID1()
[1] "000000001"
> idCounter <- 123 ## global variable, not locally scoped idCounter
> emitID()
[1] "000000002"
A fun alternative is to use a 'factory' pattern to create independent counters. Your question implies that you'll call this function a billion (hmm, not sure where I got that impression...) times, so maybe it makes sense to vectorize the call to formatC by creating a buffer of ids?
idFactory <- function(buf_n=1000000) {
curr <- 0L
last <- -1L
val <- NULL
function() {
if ((curr %% buf_n) == 0L) {
val <<- formatC(last + seq_len(buf_n), width=9, flag=0, format="d")
last <<- last + buf_n
curr <<- 0L
}
val[curr <<- curr + 1L]
}
}
emitID2 <- idFactory()
and then (emitID1
is an instance of the local variable version above).
> library(microbenchmark)
> microbenchmark(emitID1(), emitID2(), times=100000)
Unit: microseconds
expr min lq median uq max neval
emitID1() 66.363 70.614 72.310 73.603 13753.96 1e+05
emitID2() 2.240 2.982 4.138 4.676 49593.03 1e+05
> emitID1()
[1] "000100000"
> emitID2()
[1] "000100000"
(the proto solution is about 3x slower than emitID1
, though speed is not everything).