Python's xrange alternative for R OR how to loop over large dataset lazilly?

后端未结

关注

 2  911

Following example is based on discussion about using expand.grid with large data. As you can see it ends up with error. I guess this is due to possible combinat

相关标签:

2条回答

臣服心动

2020-11-28 15:26

Another approach that, somehow, looks valid..:

exp_gr = function(..., index)
{
    args = list(...)
    ns = lengths(args)
    offs = cumprod(c(1L, ns))
    n = offs[length(offs)]

    stopifnot(index <= n)

    i = (index[[1L]] - 1L) %% offs[-1L] %/% offs[-length(offs)] 

    return(do.call(data.frame, 
           setNames(Map("[[", args, i + 1L), 
                    paste("Var", seq_along(args), sep = ""))))
}

In the above function, ... are the arguments to expand.grid and index is the increasing number of combinations. E.g.:

expand.grid(1:3, 10:12, 21:24, letters[2:5])[c(5, 22, 24, 35, 51, 120, 144), ]
#    Var1 Var2 Var3 Var4
#5      2   11   21    b
#22     1   11   23    b
#24     3   11   23    b
#35     2   12   24    b
#51     3   11   22    c
#120    3   10   22    e
#144    3   12   24    e
do.call(rbind, lapply(c(5, 22, 24, 35, 51, 120, 144), 
                      function(i) exp_gr(1:3, 10:12, 21:24, letters[2:5], index = i)))
#  Var1 Var2 Var3 Var4
#1    2   11   21    b
#2    1   11   23    b
#3    3   11   23    b
#4    2   12   24    b
#5    3   11   22    c
#6    3   10   22    e
#7    3   12   24    e

And on large structures:

expand.grid(1:1e2, 1:1e2, 1:1e2, 1:1e2, 1:1e2, 1:1e2)
#Error in rep.int(rep.int(seq_len(nx), rep.int(rep.fac, nx)), orep) : 
#  invalid 'times' value
#In addition: Warning message:
#In rep.int(rep.int(seq_len(nx), rep.int(rep.fac, nx)), orep) :
#  NAs introduced by coercion to integer range
exp_gr(1:1e2, 1:1e2, 1:1e2, 1:1e2, 1:1e2, 1:1e2, index = 1)
#  Var1 Var2 Var3 Var4 Var5 Var6
#1    1    1    1    1    1    1
exp_gr(1:1e2, 1:1e2, 1:1e2, 1:1e2, 1:1e2, 1:1e2, index = 1e3 + 487)
#  Var1 Var2 Var3 Var4 Var5 Var6
#1   87   15    1    1    1    1
exp_gr(1:1e2, 1:1e2, 1:1e2, 1:1e2, 1:1e2, 1:1e2, index = 1e2 ^ 6)
#  Var1 Var2 Var3 Var4 Var5 Var6
#1  100  100  100  100  100  100
exp_gr(1:1e2, 1:1e2, 1:1e2, 1:1e2, 1:1e2, 1:1e2, index = 1e11 + 154)
#  Var1 Var2 Var3 Var4 Var5 Var6
#1   54    2    1    1    1   11

A similar approach to this would be to construct a "class" that stores the ... arguments to use expand.grid on and define a [ method to calculate the appropriate combination index when needed. Using %% and %/% seems valid, though, I guess iterating with these operators will be slower than it needs to be.

0 讨论(0)

夕颜

2020-11-28 15:38

One (arguably more "proper") way to approach this would be to write your own iterator for iterators that @BenBolker suggested (pdf on writing extensions is here). Lacking something more formal, here is a poor-man's iterator, similar to expand.grid but manually-advancing. (Note: this will suffice given that the computation on each iteration is "more expensive" than this function itself. This could really be improved, but "it works".)

This function returns a named list (with the provided factors) each time the returned function is returned. It is lazy in that it does not expand the entire list of possibles; it is not lazy with the argument themselves, they should be 'consumed' immediately.

lazyExpandGrid <- function(...) {
  dots <- list(...)
  sizes <- sapply(dots, length, USE.NAMES = FALSE)
  indices <- c(0, rep(1, length(dots)-1))
  function() {
    indices[1] <<- indices[1] + 1
    DONE <- FALSE
    while (any(rolls <- (indices > sizes))) {
      if (tail(rolls, n=1)) return(FALSE)
      indices[rolls] <<- 1
      indices[ 1+which(rolls) ] <<- indices[ 1+which(rolls) ] + 1
    }
    mapply(`[`, dots, indices, SIMPLIFY = FALSE)
  }
}

Sample usage:

nxt <- lazyExpandGrid(a=1:3, b=15:16, c=21:22)
nxt()
#   a  b  c
# 1 1 15 21
nxt()
#   a  b  c
# 1 2 15 21
nxt()
#   a  b  c
# 1 3 15 21
nxt()
#   a  b  c
# 1 1 16 21

## <yawn>

nxt()
#   a  b  c
# 1 3 16 22
nxt()
# [1] FALSE

NB: for brevity of display, I used as.data.frame(mapply(...)) for the example; it works either way, but if a named list works fine for you then the conversion to a data.frame isn't necessary.

EDIT

Based on alexis_laz's answer, here's a much-improved version that is (a) much faster and (b) allows arbitrary seeking.

lazyExpandGrid <- function(...) {
  dots <- list(...)
  argnames <- names(dots)
  if (is.null(argnames)) argnames <- paste0('Var', seq_along(dots))
  sizes <- lengths(dots)
  indices <- cumprod(c(1L, sizes))
  maxcount <- indices[ length(indices) ]
  i <- 0
  function(index) {
    i <<- if (missing(index)) (i + 1L) else index
    if (length(i) > 1L) return(do.call(rbind.data.frame, lapply(i, sys.function(0))))
    if (i > maxcount || i < 1L) return(FALSE)
    setNames(Map(`[[`, dots, (i - 1L) %% indices[-1L] %/% indices[-length(indices)] + 1L  ),
             argnames)
  }
}

It works with no arguments (auto-increment the internal counter), one argument (seek and set the internal counter), or a vector argument (seek to each and set the counter to the last, returns a data.frame).

This last use-case allows for sampling a subset of the design space:

set.seed(42)
nxt <- lazyExpandGrid2(a=1:1e2, b=1:1e2, c=1:1e2, d=1:1e2, e=1:1e2, f=1:1e2)
as.data.frame(nxt())
#   a b c d e f
# 1 1 1 1 1 1 1
nxt(sample(1e2^6, size=7))
#      a  b  c  d  e  f
# 2   69 61  7  7 49 92
# 21  72 28 55 40 62 29
# 3   88 32 53 46 18 65
# 4   88 33 31 89 66 74
# 5   57 75 31 93 70 66
# 6  100 86 79 42 78 46
# 7   55 41 25 73 47 94

Thanks alexis_laz for the improvements of cumprod, Map, and index calculations!

0 讨论(0)