Conditional cumsum with reset

我是研究僧i 提交于 2019-11-28 06:59:06
Rentrop

I think this is not easily vectorizable.... at least i do not know how.

You can do it by hand via:

my_cumsum <- function(x){
  grp = integer(length(x))
  grp[1] = 1
  for(i in 2:length(x)){
    if(x[i-1] + x[i] <= 10){
      grp[i] = grp[i-1]
      x[i] = x[i-1] + x[i]
    } else {
      grp[i] = grp[i-1] + 1
    }
  }
  data.frame(grp, x)
}

For your data this gives:

> my_cumsum(df$value)
   grp  x
1    1  4
2    1  9
3    2  7
4    2 10
5    3  8
6    3  9
7    4  2
8    4  7
9    4 10
10   5  6
11   5  8
12   6  6
13   6  9
14   6 10
15   7  4

Also for my "counter-example" this gives:

> my_cumsum(c(10,6,4))
  grp  x
1   1 10
2   2  6
3   2 10

As @Khashaa pointed out this can be implementet more efficiently via Rcpp. He linked to this answer How to speed up or vectorize a for loop? which i find very useful

The function below uses recursion to construct a vector with the lengths of each group. It is faster than a loop for small data vectors (length less than about a hundred values), but slower for longer ones. It takes three arguments:

1) vec: A vector of values that we want to group.

2) i: The index of the starting position in vec.

3) glv: A vector of group lengths. This is the return value, but we need to initialize it and pass it along through each recursion.

# Group a vector based on consecutive values with a cumulative sum <= 10
gf = function(vec, i, glv) {

  ## Break out of the recursion when we get to the last group
  if (sum(vec[i:length(vec)]) <= 10) {
    glv = c(glv, length(i:length(vec)))
    return(glv)
  }

  ## Keep recursion going if there are at least two groups left
  # Calculate length of current group
  gl = sum(cumsum(vec[i:length(vec)]) <= 10)

  # Append to previous group lengths
  glv.append = c(glv, gl)

  # Call function recursively 
  gf(vec, i + gl, glv.append)
}

Run the function to return a vector of group lengths:

group_vec = gf(df$value, 1, numeric(0))
[1] 2 2 2 3 2 3 1

To add a column to df with the group lengths, use rep:

df$group10 = rep(1:length(group_vec), group_vec)

In its current form the function will only work on vectors that don't have any values greater than 10, and the grouping by sums <= 10 is hard-coded. The function can of course be generalized to deal with these limitations.

The function can be speeded up somewhat by doing cumulative sums that look ahead only a certain number of values, rather than the remaining length of the vector. For example, if the values are always positive, you only need to look ten values ahead, since you'll never need to sum more than ten numbers to reach a value of 10. This too can be generalized for any target value. Even with this modification, the function is still slower than a loop for a vector with more than about a hundred values.

I haven't worked with recursive functions in R before and would be interested in any comments and suggestions on whether recursion makes sense for this type of problem and whether it can be improved, especially execution speed.

You could define your own function and then use it inside dplyr's mutate statement as follows:

df %>% group_by() %>%
  mutate(
    cumsum_10 = cumsum_with_reset(value, 10),
    group_10 = cumsum_with_reset_group(value, 10)
  ) %>% 
  ungroup()

The cumsum_with_reset() function takes a column and a threshold value which resets the sum. cumsum_with_reset_group() is similar but identifies rows that have been grouped together. Definitions are as follows:

# group rows based on cumsum with reset
cumsum_with_reset_group <- function(x, threshold) {
  cumsum <- 0
  group <- 1
  result <- numeric()

  for (i in 1:length(x)) {
    cumsum <- cumsum + x[i]

    if (cumsum > threshold) {
      group <- group + 1
      cumsum <- x[i]
    }

    result = c(result, group)

  }

  return (result)
}

# cumsum with reset
cumsum_with_reset <- function(x, threshold) {
  cumsum <- 0
  group <- 1
  result <- numeric()

  for (i in 1:length(x)) {
    cumsum <- cumsum + x[i]

    if (cumsum > threshold) {
      group <- group + 1
      cumsum <- x[i]
    }

    result = c(result, cumsum)

  }

  return (result)
}

# use functions above as window functions inside mutate statement
df %>% group_by() %>%
  mutate(
    cumsum_10 = cumsum_with_reset(value, 10),
    group_10 = cumsum_with_reset_group(value, 10)
  ) %>% 
  ungroup()

We can take advantage of the function cumsumbinning, from the package MESS, that performs this task:

library(MESS)
df %>%
  group_by(group_10 = cumsumbinning(value, 10)) %>%
  mutate(cumsum_10 = cumsum(value)) 

Output

# A tibble: 15 x 5
# Groups:   group_10 [7]
      id order value group_10 cumsum_10
   <int> <int> <dbl>    <int>     <dbl>
 1     6     1     4        1         4
 2    10     2     5        1         9
 3     1     3     7        2         7
 4     5     4     3        2        10
 5     3     5     8        3         8
 6     9     6     1        3         9
 7    14     7     2        4         2
 8    11     8     5        4         7
 9    15     9     3        4        10
10     8    10     6        5         6
11    12    11     2        5         8
12     2    12     6        6         6
13     4    13     3        6         9
14     7    14     1        6        10
15    13    15     4        7         4
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!