问题
here's some dummy data:
user_id date category
27 2016-01-01 apple
27 2016-01-03 apple
27 2016-01-05 pear
27 2016-01-07 plum
27 2016-01-10 apple
27 2016-01-14 pear
27 2016-01-16 plum
11 2016-01-01 apple
11 2016-01-03 pear
11 2016-01-05 pear
11 2016-01-07 pear
11 2016-01-10 apple
11 2016-01-14 apple
11 2016-01-16 apple
I'd like to calculate for each user_id
the number of distinct categories
in the specified time period (e.g. in the past 7, 14 days), including the current order
The solution would look like this:
user_id date category distinct_7 distinct_14
27 2016-01-01 apple 1 1
27 2016-01-03 apple 1 1
27 2016-01-05 pear 2 2
27 2016-01-07 plum 3 3
27 2016-01-10 apple 3 3
27 2016-01-14 pear 3 3
27 2016-01-16 plum 3 3
11 2016-01-01 apple 1 1
11 2016-01-03 pear 2 2
11 2016-01-05 pear 2 2
11 2016-01-07 pear 2 2
11 2016-01-10 apple 2 2
11 2016-01-14 apple 2 2
11 2016-01-16 apple 1 2
I posted similar questions here or here, however none of it referred to counting cumulative unique values for the specified time period. Thanks a lot for your help!
回答1:
In the tidyverse, you can use map_int
to iterate over a set of values and simplify to an integer à la sapply
or vapply
. Count distinct occurrences with n_distinct
(like length(unique(...))
) of an object subset by comparisons or the helper between
, with a minimum set by the appropriate amount subtracted from that day, and you're set.
library(tidyverse)
df %>% group_by(user_id) %>%
mutate(distinct_7 = map_int(date, ~n_distinct(category[between(date, .x - 7, .x)])),
distinct_14 = map_int(date, ~n_distinct(category[between(date, .x - 14, .x)])))
## Source: local data frame [14 x 5]
## Groups: user_id [2]
##
## user_id date category distinct_7 distinct_14
## <int> <date> <fctr> <int> <int>
## 1 27 2016-01-01 apple 1 1
## 2 27 2016-01-03 apple 1 1
## 3 27 2016-01-05 pear 2 2
## 4 27 2016-01-07 plum 3 3
## 5 27 2016-01-10 apple 3 3
## 6 27 2016-01-14 pear 3 3
## 7 27 2016-01-16 plum 3 3
## 8 11 2016-01-01 apple 1 1
## 9 11 2016-01-03 pear 2 2
## 10 11 2016-01-05 pear 2 2
## 11 11 2016-01-07 pear 2 2
## 12 11 2016-01-10 apple 2 2
## 13 11 2016-01-14 apple 2 2
## 14 11 2016-01-16 apple 1 2
回答2:
Here are two data.table
solutions, one with two nested lapply
and the other using non-equi joins.
The first one is a rather clumsy data.table
solution but it reproduces the expected answer. And it would work for an arbitrary number of time frames. (Although @alistaire's concise tidyverse
solution he had suggested in his comment could be modified as well).
It uses two nested lapply
. The first one loops over the time frames, the second one over the dates. The tempory result is joined with the original data and then reshaped from long to wide format so that we will end with a separate column for each of the time frames.
library(data.table)
tmp <- rbindlist(
lapply(c(7L, 14L),
function(ldays) rbindlist(
lapply(unique(dt$date),
function(ldate) {
dt[between(date, ldate - ldays, ldate),
.(distinct = sprintf("distinct_%02i", ldays),
date = ldate,
N = uniqueN(category)),
by = .(user_id)]
})
)
)
)
dcast(tmp[dt, on=c("user_id", "date")],
... ~ distinct, value.var = "N")[order(-user_id, date, category)]
# date user_id category distinct_07 distinct_14
# 1: 2016-01-01 27 apple 1 1
# 2: 2016-01-03 27 apple 1 1
# 3: 2016-01-05 27 pear 2 2
# 4: 2016-01-07 27 plum 3 3
# 5: 2016-01-10 27 apple 3 3
# 6: 2016-01-14 27 pear 3 3
# 7: 2016-01-16 27 plum 3 3
# 8: 2016-01-01 11 apple 1 1
# 9: 2016-01-03 11 pear 2 2
#10: 2016-01-05 11 pear 2 2
#11: 2016-01-07 11 pear 2 2
#12: 2016-01-10 11 apple 2 2
#13: 2016-01-14 11 apple 2 2
#14: 2016-01-16 11 apple 1 2
Here is a variant following a suggestion by @Frank which uses data.table
's non-equi joins instead of the second lapply
:
tmp <- rbindlist(
lapply(c(7L, 14L),
function(ldays) {
dt[.(user_id = user_id, dago = date - ldays, d = date),
on=.(user_id, date >= dago, date <= d),
.(distinct = sprintf("distinct_%02i", ldays),
N = uniqueN(category)),
by = .EACHI]
}
)
)[, date := NULL]
#
dcast(tmp[dt, on=c("user_id", "date")],
... ~ distinct, value.var = "N")[order(-user_id, date, category)]
Data:
dt <- fread("user_id date category
27 2016-01-01 apple
27 2016-01-03 apple
27 2016-01-05 pear
27 2016-01-07 plum
27 2016-01-10 apple
27 2016-01-14 pear
27 2016-01-16 plum
11 2016-01-01 apple
11 2016-01-03 pear
11 2016-01-05 pear
11 2016-01-07 pear
11 2016-01-10 apple
11 2016-01-14 apple
11 2016-01-16 apple")
dt[, date := as.IDate(date)]
BTW: The wording in the past 7, 14 days is somewhat misleading as the time periods actually consist of 8 and 15 days, resp.
回答3:
U recommend using runner package. You can use any R function on running windows with runner
function. Code below obtains desided output, which is past 7-days + current and past 14-days + current (current 8 and 15 days):
df <- read.table(
text = " user_id date category
27 2016-01-01 apple
27 2016-01-03 apple
27 2016-01-05 pear
27 2016-01-07 plum
27 2016-01-10 apple
27 2016-01-14 pear
27 2016-01-16 plum
11 2016-01-01 apple
11 2016-01-03 pear
11 2016-01-05 pear
11 2016-01-07 pear
11 2016-01-10 apple
11 2016-01-14 apple
11 2016-01-16 apple", header = TRUE, colClasses = c("integer", "Date", "character"))
library(dplyr)
library(runner)
df %>%
group_by(user_id) %>%
mutate(distinct_7 = runner(category, k = 7 + 1, idx = date,
f = function(x) length(unique(x))),
distinct_14 = runner(category, k = 14 + 1, idx = date,
f = function(x) length(unique(x))))
More informations in package and function documentation.
来源:https://stackoverflow.com/questions/41693081/r-calculate-number-of-distinct-categories-in-the-specified-time-frame