问题
I want to seamlessly identify new orders (acquisitions) and returns in my transactional database table.
This sounds like the perfect job for a window function; I would like to perform this operation in dbplyr
.
My current process is to:
- Create a query object I then use into
dbGetQuery()
; this query contains a standardrank()
window function as usually seen inpostgresql
- Ingest this query into my R environment
Then using an
ifelse()
function into themutate()
verb, I identify the first orders (aka acquisition orders) as the ones marked with 1 by the window function and "recurring" orders otherwise.query <- "SELECT o.user_id, o.id, o.completed_at, rank() over (partition by o.user_id order by o.completed_at asc) as order_number FROM orders as o" df <- dbGetQuery(db, query) %>% mutate(order_type = ifelse(order_number == '1','acquisition','repeat'))
I assume there is a way to squeeze this process using dbplyr
but at the moment I don't know exactly how.
This is the output of the query:
id user_id completed_at order_number
1 58051 68 2019-02-02 09:45:59 1
2 78173 7173 2019-03-28 08:30:16 1
3 79585 7173 2019-04-15 21:59:51 2
4 105261 7173 2019-07-15 13:51:44 3
5 57158 7181 2019-01-02 08:30:12 1
6 64316 7185 2019-02-24 14:54:26 1
7 77556 7185 2019-03-26 08:30:26 2
8 91287 7185 2019-04-25 08:30:25 3
9 55781 7191 2018-12-04 09:21:42 1
10 57039 7191 2019-01-01 08:30:11 2
11 55947 7204 2018-12-10 20:56:41 1
12 106126 7204 2019-06-28 15:10:27 2
13 112490 7204 2019-07-19 14:38:16 3
14 112514 7204 2019-07-19 16:24:09 4
You can find test data in this gdoc -> link.
回答1:
One of the key challenges to do this that dbplyr can only translate a limited set of R commands into SQL commands. R functions outside of dplyr are unlikely to translate to SQL effectively.
Assuming db
is you database connection, I would try something like the following:
# create a local pointer to the database table
orders <- tbl(db, from = "orders")
# define table using dplyr commands
df <- orders %>%
group_by(user_id) %>%
mutate(order_number = row_number(complete_at)) %>%
select(user_id, id, completed_at) %>%
mutate(order_type = ifelse(order_number == 1, 'acquisition', 'repeat'))
# check underlying sql and confirm it is correct
df %>% show_query()
# load data into local R from database
df <- collect(df)
Depending on your application, you may prefer dense_rank
to row_number
. I have not tested the dbplyr translation of either to sql. Using only operations I know work in my R & database environment I would write:
orders <- tbl(db, from = "orders")
df_acquisition <- orders %>%
group_by(user_id) %>%
mutate(tmp = lead(complete_at, 1, order_by = "complete_at")) %>%
# only the latest record will lack a tmp value
filter(is.na(tmp)) %>%
select(user_id, id, completed_at) %>%
mutate(order_type = 'acquisition')
And then create the repeat orders separately.
来源:https://stackoverflow.com/questions/57502642/apply-a-ranking-window-function-in-dbplyr-backend