Following example is based on discussion about using expand.grid
with large data. As you can see it ends up with error. I guess this is due to possible combinat
Another approach that, somehow, looks valid..:
exp_gr = function(..., index)
{
args = list(...)
ns = lengths(args)
offs = cumprod(c(1L, ns))
n = offs[length(offs)]
stopifnot(index <= n)
i = (index[[1L]] - 1L) %% offs[-1L] %/% offs[-length(offs)]
return(do.call(data.frame,
setNames(Map("[[", args, i + 1L),
paste("Var", seq_along(args), sep = ""))))
}
In the above function, ...
are the arguments to expand.grid
and index
is the increasing number of combinations.
E.g.:
expand.grid(1:3, 10:12, 21:24, letters[2:5])[c(5, 22, 24, 35, 51, 120, 144), ]
# Var1 Var2 Var3 Var4
#5 2 11 21 b
#22 1 11 23 b
#24 3 11 23 b
#35 2 12 24 b
#51 3 11 22 c
#120 3 10 22 e
#144 3 12 24 e
do.call(rbind, lapply(c(5, 22, 24, 35, 51, 120, 144),
function(i) exp_gr(1:3, 10:12, 21:24, letters[2:5], index = i)))
# Var1 Var2 Var3 Var4
#1 2 11 21 b
#2 1 11 23 b
#3 3 11 23 b
#4 2 12 24 b
#5 3 11 22 c
#6 3 10 22 e
#7 3 12 24 e
And on large structures:
expand.grid(1:1e2, 1:1e2, 1:1e2, 1:1e2, 1:1e2, 1:1e2)
#Error in rep.int(rep.int(seq_len(nx), rep.int(rep.fac, nx)), orep) :
# invalid 'times' value
#In addition: Warning message:
#In rep.int(rep.int(seq_len(nx), rep.int(rep.fac, nx)), orep) :
# NAs introduced by coercion to integer range
exp_gr(1:1e2, 1:1e2, 1:1e2, 1:1e2, 1:1e2, 1:1e2, index = 1)
# Var1 Var2 Var3 Var4 Var5 Var6
#1 1 1 1 1 1 1
exp_gr(1:1e2, 1:1e2, 1:1e2, 1:1e2, 1:1e2, 1:1e2, index = 1e3 + 487)
# Var1 Var2 Var3 Var4 Var5 Var6
#1 87 15 1 1 1 1
exp_gr(1:1e2, 1:1e2, 1:1e2, 1:1e2, 1:1e2, 1:1e2, index = 1e2 ^ 6)
# Var1 Var2 Var3 Var4 Var5 Var6
#1 100 100 100 100 100 100
exp_gr(1:1e2, 1:1e2, 1:1e2, 1:1e2, 1:1e2, 1:1e2, index = 1e11 + 154)
# Var1 Var2 Var3 Var4 Var5 Var6
#1 54 2 1 1 1 11
A similar approach to this would be to construct a "class" that stores the ...
arguments to use expand.grid
on and define a [
method to calculate the appropriate combination index when needed. Using %%
and %/%
seems valid, though, I guess iterating with these operators will be slower than it needs to be.
One (arguably more "proper") way to approach this would be to write your own iterator for iterators
that @BenBolker suggested (pdf on writing extensions is here). Lacking something more formal, here is a poor-man's iterator, similar to expand.grid
but manually-advancing. (Note: this will suffice given that the computation on each iteration is "more expensive" than this function itself. This could really be improved, but "it works".)
This function returns a named list (with the provided factors) each time the returned function is returned. It is lazy in that it does not expand the entire list of possibles; it is not lazy with the argument themselves, they should be 'consumed' immediately.
lazyExpandGrid <- function(...) {
dots <- list(...)
sizes <- sapply(dots, length, USE.NAMES = FALSE)
indices <- c(0, rep(1, length(dots)-1))
function() {
indices[1] <<- indices[1] + 1
DONE <- FALSE
while (any(rolls <- (indices > sizes))) {
if (tail(rolls, n=1)) return(FALSE)
indices[rolls] <<- 1
indices[ 1+which(rolls) ] <<- indices[ 1+which(rolls) ] + 1
}
mapply(`[`, dots, indices, SIMPLIFY = FALSE)
}
}
Sample usage:
nxt <- lazyExpandGrid(a=1:3, b=15:16, c=21:22)
nxt()
# a b c
# 1 1 15 21
nxt()
# a b c
# 1 2 15 21
nxt()
# a b c
# 1 3 15 21
nxt()
# a b c
# 1 1 16 21
## <yawn>
nxt()
# a b c
# 1 3 16 22
nxt()
# [1] FALSE
NB: for brevity of display, I used as.data.frame(mapply(...))
for the example; it works either way, but if a named list works fine for you then the conversion to a data.frame isn't necessary.
EDIT
Based on alexis_laz's answer, here's a much-improved version that is (a) much faster and (b) allows arbitrary seeking.
lazyExpandGrid <- function(...) {
dots <- list(...)
argnames <- names(dots)
if (is.null(argnames)) argnames <- paste0('Var', seq_along(dots))
sizes <- lengths(dots)
indices <- cumprod(c(1L, sizes))
maxcount <- indices[ length(indices) ]
i <- 0
function(index) {
i <<- if (missing(index)) (i + 1L) else index
if (length(i) > 1L) return(do.call(rbind.data.frame, lapply(i, sys.function(0))))
if (i > maxcount || i < 1L) return(FALSE)
setNames(Map(`[[`, dots, (i - 1L) %% indices[-1L] %/% indices[-length(indices)] + 1L ),
argnames)
}
}
It works with no arguments (auto-increment the internal counter), one argument (seek and set the internal counter), or a vector argument (seek to each and set the counter to the last, returns a data.frame).
This last use-case allows for sampling a subset of the design space:
set.seed(42)
nxt <- lazyExpandGrid2(a=1:1e2, b=1:1e2, c=1:1e2, d=1:1e2, e=1:1e2, f=1:1e2)
as.data.frame(nxt())
# a b c d e f
# 1 1 1 1 1 1 1
nxt(sample(1e2^6, size=7))
# a b c d e f
# 2 69 61 7 7 49 92
# 21 72 28 55 40 62 29
# 3 88 32 53 46 18 65
# 4 88 33 31 89 66 74
# 5 57 75 31 93 70 66
# 6 100 86 79 42 78 46
# 7 55 41 25 73 47 94
Thanks alexis_laz for the improvements of cumprod
, Map
, and index calculations!