问题
This is my transaction data
data
id from to date amount
<int> <fctr> <fctr> <date> <dbl>
19521 6644 6934 2005-01-01 700.0
19524 6753 8456 2005-01-01 600.0
19523 9242 9333 2005-01-01 1000.0
… … … … …
1055597 9866 9736 2010-12-31 278.9
1053519 9868 8644 2010-12-31 242.8
1052790 9869 8399 2010-12-31 372.2
Now for each distinct account in from
column, I want to keep track of how much transaction amount they sent over last 6 month at the time the transaction was made and so I want to do it according to the transaction date at which the particular transaction was made.
To see it better I will only consider the account 5370
here. So, then let's consider the following data:
id from to date amount
<int> <fctr> <fctr> <date> <dbl>
18529 5370 9356 2005-05-31 24.4
13742 5370 5605 2005-08-05 7618.0
9913 5370 8567 2005-09-12 21971.0
2557 5370 5636 2005-11-12 2921.0
18669 5370 8933 2005-11-30 169.2
35900 5370 8483 2006-01-31 71.5
51341 5370 7626 2006-04-11 4214.0
83324 5370 9676 2006-08-31 261.1
100277 5370 9105 2006-10-31 182.0
103444 5370 9772 2006-11-08 16927.0
The very first transaction 5370
made was on 2005-05-31
. So there's no any record before that. That's why this is the starting date point for 5370
(So, each distinct account will have their own starting date point based on which date they made their first transaction). Thus, total transaction amount sent by 5370
in last 6 month at that time was just 24.4. Going to the next transaction of 5370
, there comes the second transaction made on 2005-08-05
. At that time, total transaction amount sent by 5370
in last 6 month was 24.4 + 7618.0 = 7642.4
. So, the output should be as follows:
id from to date amount total_trx_amount_sent_in_last_6month_by_from
<int> <fctr> <fctr> <date> <dbl> <dbl>
18529 5370 9356 2005-05-31 24.4 24.4
13742 5370 5605 2005-08-05 7618.0 (24.4+7618.0)=7642.4
9913 5370 8567 2005-09-12 21971.0 (24.4+7618.0+21971.0)=29613.4
2557 5370 5636 2005-11-12 2921.0 (24.4+7618.0+21971.0+2921.0)=32534.4
18669 5370 8933 2005-11-30 169.2 (7618.0+21971.0+2921.0+169.2)=32679.2
35900 5370 8483 2006-01-31 71.5 (7618.0+21971.0+2921.0+169.2+71.5)=32750.7
51341 5370 7626 2006-04-11 4214.0 (2921.0+169.2+71.5+4214.0)=7375.7
83324 5370 9676 2006-08-31 261.1 (4214.0+261.1)=4475.1
100277 5370 9105 2006-10-31 182.0 (261.1+182.0)=443.1
103444 5370 9772 2006-11-08 16927.0 (261.1+182.0+16927.0)=17370.1
For the calculations, I subtracted 180 days(approx. 6 months) from the transaction date on each line. That's how I chose which amounts should be summed up.
So, how can I achieve this for the whole data, considering all the distinct accounts?
PS: My data has 1 million rows so the solution also should run faster on a large dataset.
回答1:
A way using dplyr
could be :
library(dplyr)
df %>%
group_by(from) %>%
mutate(total_trx = purrr::map_dbl(date,
~sum(amount[between(date, .x - 180, .x)])))
# id from to date amount total_trx
# <int> <int> <int> <date> <dbl> <dbl>
# 1 18529 5370 9356 2005-05-31 24.4 24.4
# 2 13742 5370 5605 2005-08-05 7618 7642.
# 3 9913 5370 8567 2005-09-12 21971 29613.
# 4 2557 5370 5636 2005-11-12 2921 32534.
# 5 18669 5370 8933 2005-11-30 169. 32679.
# 6 35900 5370 8483 2006-01-31 71.5 32751.
# 7 51341 5370 7626 2006-04-11 4214 7376.
# 8 83324 5370 9676 2006-08-31 261. 4475.
# 9 100277 5370 9105 2006-10-31 182 443.
#10 103444 5370 9772 2006-11-08 16927 17370.
If you are data is huge you can use the above approach in data.table
which might be efficient.
library(data.table)
setDT(df)[, total_trx := sapply(date, function(x)
sum(amount[between(date, x - 180, x)])), from]
来源:https://stackoverflow.com/questions/63689198/how-can-i-keep-track-of-total-transaction-amount-sent-from-an-account-each-last