问题
I'm trying to create a variable that defines true vs false searches. The original dataset is located here: https://github.com/wikimedia-research/Discovery-Hiring-Analyst-2016/blob/master/events_log.csv.gz
The basic scenario is that there are variables that define how many times a user (defined by ID- either session_id or uuid in the original dataset) performs a true search vs a false search, such that a visit is always preceded by a search, but a search does not have to be followed by a visit. If you check the original dataset there is also a time variable, timestamp, that I do not know how to replicate but I believe will be useful.
A sketchy version of the original structure:
ID Action Time
a search 1
a visit 2
a search 3
a visit 4
b visit 2
b visit 3
b search 1
c search 5
c search 6
c search 7
c visit 8
d search 3
d search 4
I'm trying to create a variable that defines true vs false searches. The above data is expected to be sorted by Action = search only such as in the following format:
Structure I'm trying to produce:
ID Action ClickThrough
a search T
a search T
b search T
c search F
c search F
c search T
d search F
d search F
回答1:
This produces the expected output using dplyr
library(dplyr)
df1 %>%
arrange(ID,Time) %>%
group_by(ID) %>%
mutate(ClickThrough = c(as.logical(diff(Action=="visit")),FALSE)) %>%
filter(Action=="search")
# # A tibble: 8 x 4
# # Groups: ID [4]
# ID Action Time ClickThrough
# <chr> <chr> <int> <lgl>
# 1 a search 1 TRUE
# 2 a search 3 TRUE
# 3 b search 1 TRUE
# 4 c search 5 FALSE
# 5 c search 6 FALSE
# 6 c search 7 TRUE
# 7 d search 3 FALSE
# 8 d search 4 FALSE
来源:https://stackoverflow.com/questions/48800381/dplyr-table-reconstructing-data-wrangling