Let\'s say I have a dataset contain visits in a hospital. My goal is to generate a variable that counts the number of unique patients the visitor has seen before at the date
You can do:
with(df, ave(patient, visitor, FUN = function(x) cumsum(!duplicated(x))))
[1] 1 1 1 2 2 2 2 2 3 3
Essentially, it is a cumulative sum of non-duplicated values per group.
And you can also do the same with dplyr
:
df %>%
group_by(visitor) %>%
mutate(res = cumsum(!duplicated(patient)))
We can use dplyr
library(dplyr)
df1 %>%
group_by(visitor) %>%
mutate(goal = cummax(match(patient, unique(patient))))
#or with factor
# mutate(goal1 = cummax(as.integer(factor(patient, levels = unique(patient)))))
# A tibble: 10 x 4
# Groups: visitor [1]
# visitor visitdt patient goal
# <int> <chr> <int> <int>
# 1 125469 1/12/2018 15200 1
# 2 125469 1/19/2018 15200 1
# 3 125469 2/16/2018 15200 1
# 4 125469 2/23/2018 52607 2
# 5 125469 3/9/2018 52607 2
# 6 125469 3/16/2018 52607 2
# 7 125469 3/23/2018 15200 2
# 8 125469 3/29/2018 15200 2
# 9 125469 3/30/2018 20589 3
#10 125469 4/6/2018 20589 3
df1 <- structure(list(visitor = c(125469L, 125469L, 125469L, 125469L,
125469L, 125469L, 125469L, 125469L, 125469L, 125469L), visitdt = c("1/12/2018",
"1/19/2018", "2/16/2018", "2/23/2018", "3/9/2018", "3/16/2018",
"3/23/2018", "3/29/2018", "3/30/2018", "4/6/2018"), patient = c(15200L,
15200L, 15200L, 52607L, 52607L, 52607L, 15200L, 15200L, 20589L,
20589L), goal = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L)),
class = "data.frame", row.names = c(NA,
-10L))
Sounds important with what you are tracking. Another option using data.table
using non-equi join and then update by reference:
DT[, goal2 :=
DT[.SD, on=.(visitor, visitdt<=visitdt), allow.cartesian=TRUE,
length(unique(patient)), by=.EACHI]$V1]
output:
visitor visitdt patient goal goal2
1: 125469 2018-01-12 15200 1 1
2: 125469 2018-01-19 15200 1 1
3: 125469 2018-02-16 15200 1 1
4: 125469 2018-02-23 52607 2 2
5: 125469 2018-03-09 52607 2 2
6: 125469 2018-03-16 52607 2 2
7: 125469 2018-03-23 15200 2 2
8: 125469 2018-03-29 15200 2 2
9: 125469 2018-03-30 20589 3 3
10: 125469 2018-04-06 20589 3 3
data:
library(data.table)
DT <- fread("visitor visitdt patient goal
125469 1/12/2018 15200 1
125469 1/19/2018 15200 1
125469 2/16/2018 15200 1
125469 2/23/2018 52607 2
125469 3/9/2018 52607 2
125469 3/16/2018 52607 2
125469 3/23/2018 15200 2
125469 3/29/2018 15200 2
125469 3/30/2018 20589 3
125469 4/6/2018 20589 3")
DT[, visitdt := as.Date(visitdt, "%m/%d/%Y")]