R Cumulative Sum with a condition and a reset

后端 未结 3 558
日久生厌
日久生厌 2020-12-17 04:48

I have a signal position indicator vector consisting out of -1s and 1s. In addition, I have volume data which I want to sum based on the value of Signal. The basic data tabl

相关标签:
3条回答
  • 2020-12-17 05:29

    As suggested by @docendo this should work:

    df[,cum := cumsum(Volume)*Signal,.(rleid(Signal))]
    
              date Signal    Volume        cum
     1: 2016-01-04     NA  37912403         NA
     2: 2016-01-05     -1  23258238  -23258238
     3: 2016-01-06     -1  25096183  -48354421
     4: 2016-01-07     -1  45172906  -93527327
     5: 2016-01-08     -1  35402298 -128929625
     6: 2016-01-11     -1  29932385 -158862010
     7: 2016-01-12     -1  28395390 -187257400
     8: 2016-01-13     -1  33410553 -220667953
     9: 2016-01-14     -1  48658623 -269326576
    10: 2016-01-15      1  46132781   46132781
    11: 2016-01-19      1  30998256   77131037
    
    0 讨论(0)
  • 2020-12-17 05:46

    This can be achieved by:

    library(tidyverse)
    library(data.table)     
    
    z %>%
      group_by(rleid(Signal)) %>% #advance value every time Signal changes and group by that
      mutate(cum = Signal*cumsum(Volume)) %>% #cumsum in each group
      ungroup() %>% #ungroup so you could remove the grouping column
      select(-4) #remove grouping column
    

    or without data.table by using rle:

    z %>%
      mutate(rl = rep(1:length(rle(Signal)$length), times = rle(Signal)$length)) %>%
      group_by(rl) %>%
      mutate(cum = Signal*cumsum(Volume)) %>%
      ungroup() %>%
      select(-4)
    
    #output
        date       Signal    Volume        cum
    
      <fct>       <int>     <int>      <int>
     1 2016-01-04     NA  37912403         NA
     2 2016-01-05    - 1  23258238 - 23258238
     3 2016-01-06    - 1  25096183 - 48354421
     4 2016-01-07    - 1  45172906 - 93527327
     5 2016-01-08    - 1  35402298 -128929625
     6 2016-01-11    - 1  29932385 -158862010
     7 2016-01-12    - 1  28395390 -187257400
     8 2016-01-13    - 1  33410553 -220667953
     9 2016-01-14    - 1  48658623 -269326576
    10 2016-01-15      1  46132781   46132781
    11 2016-01-19      1  30998256   77131037
    12 2016-01-20    - 1  59051429 - 59051429
    13 2016-01-21      1  30518939   30518939
    14 2016-01-22      1  30495387   61014326
    15 2016-01-25      1  32482015   93496341
    16 2016-01-26    - 1  26877080 - 26877080
    17 2016-01-27    - 1  58699359 - 85576439
    18 2016-01-28      1 107475327  107475327
    19 2016-01-29      1  62739548  170214875
    20 2016-02-01      1  46132726  216347601
    

    data:

    z <- read.table(text =      "date     Signal    Volume
               2016-01-04     NA  37912403
               2016-01-05     -1  23258238
               2016-01-06     -1  25096183
               2016-01-07     -1  45172906
               2016-01-08     -1  35402298
               2016-01-11     -1  29932385
               2016-01-12     -1  28395390
               2016-01-13     -1  33410553
               2016-01-14     -1  48658623
               2016-01-15      1  46132781
               2016-01-19      1  30998256
               2016-01-20     -1  59051429
               2016-01-21      1  30518939
               2016-01-22      1  30495387
               2016-01-25      1  32482015
               2016-01-26     -1  26877080
               2016-01-27     -1  58699359
               2016-01-28      1 107475327
               2016-01-29      1  62739548
               2016-02-01      1  46132726", header = T)
    
    0 讨论(0)
  • 2020-12-17 05:52

    A pure dplyr way would be:

    df %>% 
      na.omit() %>% # omit NA to not multiply by NA
      mutate(isStep = (Signal - lag(Signal, 1)) != 0) %>% # Create a dummy variable for steps 
      mutate(isStep = ifelse(is.na(isStep), FALSE, isStep)) %>% 
      mutate(grp = cumsum(isStep)) %>% # create new ID based on steps
      group_by(grp) %>%  # group by before created steps
      mutate(res = cumsum(Signal * Volume)) %>% # calculate value
      select(x, Signal, Volume, res)
    
    # # A tibble: 19 x 5
    # # Groups:   grp [6]
    #      grp          x Signal    Volume        res
    #    <int>     <fctr>  <int>     <int>      <int>
    #  1     0 2016-01-05     -1  23258238  -23258238
    #  2     0 2016-01-06     -1  25096183  -48354421
    #  3     0 2016-01-07     -1  45172906  -93527327
    #  4     0 2016-01-08     -1  35402298 -128929625
    #  5     0 2016-01-11     -1  29932385 -158862010
    #  6     0 2016-01-12     -1  28395390 -187257400
    #  7     0 2016-01-13     -1  33410553 -220667953
    #  8     0 2016-01-14     -1  48658623 -269326576
    #  9     1 2016-01-15      1  46132781   46132781
    # 10     1 2016-01-19      1  30998256   77131037
    # 11     2 2016-01-20     -1  59051429  -59051429
    # 12     3 2016-01-21      1  30518939   30518939
    # 13     3 2016-01-22      1  30495387   61014326
    # 14     3 2016-01-25      1  32482015   93496341
    # 15     4 2016-01-26     -1  26877080  -26877080
    # 16     4 2016-01-27     -1  58699359  -85576439
    # 17     5 2016-01-28      1 107475327  107475327
    # 18     5 2016-01-29      1  62739548  170214875
    # 19     5 2016-02-01      1  46132726  216347601
    
    0 讨论(0)
提交回复
热议问题