Count distinct by group- moving window

前端 未结 3 1293
没有蜡笔的小新
没有蜡笔的小新 2021-01-15 14:02

Let\'s say I have a dataset contain visits in a hospital. My goal is to generate a variable that counts the number of unique patients the visitor has seen before at the date

相关标签:
3条回答
  • 2021-01-15 14:14

    You can do:

    with(df, ave(patient, visitor, FUN = function(x) cumsum(!duplicated(x))))
    
     [1] 1 1 1 2 2 2 2 2 3 3
    

    Essentially, it is a cumulative sum of non-duplicated values per group.

    And you can also do the same with dplyr:

    df %>%
     group_by(visitor) %>%
     mutate(res = cumsum(!duplicated(patient)))
    
    0 讨论(0)
  • 2021-01-15 14:18

    We can use dplyr

    library(dplyr)   
    df1 %>%
       group_by(visitor) %>%
        mutate(goal = cummax(match(patient, unique(patient))))
        #or with factor
        # mutate(goal1 = cummax(as.integer(factor(patient, levels = unique(patient)))))
    
    # A tibble: 10 x 4
    # Groups:   visitor [1]
    #   visitor visitdt   patient  goal
    #     <int> <chr>       <int> <int>
    # 1  125469 1/12/2018   15200     1
    # 2  125469 1/19/2018   15200     1
    # 3  125469 2/16/2018   15200     1
    # 4  125469 2/23/2018   52607     2
    # 5  125469 3/9/2018    52607     2
    # 6  125469 3/16/2018   52607     2
    # 7  125469 3/23/2018   15200     2
    # 8  125469 3/29/2018   15200     2
    # 9  125469 3/30/2018   20589     3
    #10  125469 4/6/2018    20589     3
    

    data

    df1 <- structure(list(visitor = c(125469L, 125469L, 125469L, 125469L, 
    125469L, 125469L, 125469L, 125469L, 125469L, 125469L), visitdt = c("1/12/2018", 
    "1/19/2018", "2/16/2018", "2/23/2018", "3/9/2018", "3/16/2018", 
    "3/23/2018", "3/29/2018", "3/30/2018", "4/6/2018"), patient = c(15200L, 
    15200L, 15200L, 52607L, 52607L, 52607L, 15200L, 15200L, 20589L, 
    20589L), goal = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L)),
    class = "data.frame", row.names = c(NA, 
    -10L))
    
    0 讨论(0)
  • 2021-01-15 14:27

    Sounds important with what you are tracking. Another option using data.table using non-equi join and then update by reference:

    DT[, goal2 :=
        DT[.SD, on=.(visitor, visitdt<=visitdt), allow.cartesian=TRUE, 
            length(unique(patient)), by=.EACHI]$V1]
    

    output:

        visitor    visitdt patient goal goal2
     1:  125469 2018-01-12   15200    1     1
     2:  125469 2018-01-19   15200    1     1
     3:  125469 2018-02-16   15200    1     1
     4:  125469 2018-02-23   52607    2     2
     5:  125469 2018-03-09   52607    2     2
     6:  125469 2018-03-16   52607    2     2
     7:  125469 2018-03-23   15200    2     2
     8:  125469 2018-03-29   15200    2     2
     9:  125469 2018-03-30   20589    3     3
    10:  125469 2018-04-06   20589    3     3
    

    data:

    library(data.table)
    DT <- fread("visitor visitdt patient goal
    125469  1/12/2018   15200   1
    125469  1/19/2018   15200   1
    125469  2/16/2018   15200   1
    125469  2/23/2018   52607   2
    125469  3/9/2018    52607   2
    125469  3/16/2018   52607   2
    125469  3/23/2018   15200   2
    125469  3/29/2018   15200   2
    125469  3/30/2018   20589   3
    125469  4/6/2018    20589   3")
    DT[, visitdt := as.Date(visitdt, "%m/%d/%Y")]
    
    0 讨论(0)
提交回复
热议问题