Summarize consecutive failures with dplyr and rle

后端 未结 2 1178
[愿得一人]
[愿得一人] 2021-01-20 03:56

I\'m trying to build a churn model that includes the maximum consecutive number of UX failures for each customer and having trouble. Here\'s my simplified data and desired o

相关标签:
2条回答
  • 2021-01-20 04:15

    We group by the 'customerId' and use do to perform the rle on 'isFailure' column. Extract the lengths that are 'TRUE' for values (lengths[values]), and create the 'Max' column with an if/else condition to return 0 for those that didn't have any 1 value.

     df %>%
        group_by(customerId) %>%
        do({tmp <- with(rle(.$isFailure==1), lengths[values])
         data.frame(customerId= .$customerId, Max=if(length(tmp)==0) 0 
                        else max(tmp)) }) %>% 
         slice(1L)
    #   customerId Max
    #1          1   0
    #2          2   1
    #3          3   2
    
    0 讨论(0)
  • 2021-01-20 04:18

    Here is my try, only using standard dplyr functions:

    df %>% 
      # grouping key(s):
      group_by(customerId) %>%
      # check if there is any value change
      # if yes, a new sequence id is generated through cumsum
      mutate(last_one = lag(isFailure, 1, default = 100), 
             not_eq = last_one != isFailure, 
             seq = cumsum(not_eq)) %>% 
      # the following is just to find the largest sequence
      count(customerId, isFailure, seq) %>% 
      group_by(customerId, isFailure) %>% 
      summarise(max_consecutive_event = max(n))
    

    Output:

    # A tibble: 5 x 3
    # Groups:   customerId [3]
      customerId isFailure max_consecutive_event
           <dbl>     <dbl>                 <int>
    1          1         0                     1
    2          2         0                     1
    3          2         1                     1
    4          3         0                     1
    5          3         1                     2
    

    A final filter on isFailure value would yield the wanted result (need to add back 0 failure count customers though).

    The script can take any values of isFailure column and count the maximum consecutive days of having the same value.

    0 讨论(0)
提交回复
热议问题