问题
For the dataset test
, my objective is to find out how many unique users carried over from one period to the next on a period-by-period basis.
> test
user_id period
1 1 1
2 5 1
3 1 1
4 3 1
5 4 1
6 2 2
7 3 2
8 2 2
9 3 2
10 1 2
11 5 3
12 5 3
13 2 3
14 1 3
15 4 3
16 5 4
17 5 4
18 5 4
19 4 4
20 3 4
For example, in the first period there were four unique users (1, 3, 4, and 5), two of which were active in the second period. Therefore the retention rate would be 0.5. In the second period there were three unique users, two of which were active in the third period, and so the retention rate would be 0.666, and so on. How would one find the percentage of unique users that are active in the following period? Any suggestions would be appreciated.
The output would be the following:
> output
period retention
1 1 NA
2 2 0.500
3 3 0.666
4 4 0.500
The test
data:
> dput(test)
structure(list(user_id = c(1, 5, 1, 3, 4, 2, 3, 2, 3, 1, 5, 5,
2, 1, 4, 5, 5, 5, 4, 3), period = c(1, 1, 1, 1, 1, 2, 2, 2, 2,
2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4)), .Names = c("user_id", "period"
), row.names = c(NA, -20L), class = "data.frame")
回答1:
This isn't so elegant but it seems to work. Assuming df
is the data frame:
# make a list to hold unique IDS by
uniques = list()
for(i in 1:max(df$period)){
uniques[[i]] = unique(df$user_id[df$period == i])
}
# hold the retention rates
retentions = rep(NA, times = max(df$period))
for(j in 2:max(df$period)){
retentions[j] = mean(uniques[[j-1]] %in% uniques[[j]])
}
Basically the %in% creates a logical of whether or not each element of the first argument is in the second. Taking a mean gives us the proportion.
回答2:
How about this? First split the users by period, then write a function that calculates the proportion carryover between any two periods, then loop it through the split list with mapply
.
splt <- split(test$user_id, test$period)
carryover <- function(x, y) {
length(unique(intersect(x, y))) / length(unique(x))
}
mapply(carryover, splt[1:(length(splt) - 1)], splt[2:length(splt)])
1 2 3
0.5000000 0.6666667 0.5000000
回答3:
Here is an attempt using dplyr
, though it also uses some standard syntax in the summarise
:
test %>%
group_by(period) %>%
summarise(retention=length(intersect(user_id,test$user_id[test$period==(period+1)]))/n_distinct(user_id)) %>%
mutate(retention=lag(retention))
This returns:
period retention
<dbl> <dbl>
1 1 NA
2 2 0.5000000
3 3 0.6666667
4 4 0.5000000
来源:https://stackoverflow.com/questions/44078742/calculating-simple-retention-in-r