问题
Hopefully the title is explicit enough.
I have a table looking like that :
classes id value
a 1 10
a 2 15
a 3 12
b 1 5
b 2 9
b 3 7
c 1 6
c 2 14
c 3 6
and here is what I would like :
classes id value cumsum
a 1 10 10
a 2 15 25
a 3 12 37
b 1 5 5
b 2 9 14
b 3 7 21
c 1 6 6
c 2 14 20
c 3 6 26
I've seen this solution, and I've already applied it successfully to cases where I don't have multiple classes :
id value cumsum
1 10 10
2 15 25
3 12 37
It was reasonably fast, even with datasets of size equivalent to the one I'm currently working on.
However, when I try to apply the exact same code to the dataset I'm working on now (which looks like the first table of this question, IE multiple classes), without subsetting it by a
,b
,c
, it seems to me that it's taking ages (it's been running for 4 hours now. The dataset is 40.000 rows).
Any idea if there is an issue with the code from the linked answer, when used in this context ? I have trouble wrapping my head around the triangular join thingy, but I have the feeling there might be an issue with the size the join takes when the number of rows increases, thus slowing the whole thing a lot, which maybe is even worsened by the fact that there are multiple "classes" on which to do the cumulative sums.
Is there any way this could be done faster ? I'm using SQL
in R
through the SQLDF
package. A solution in either R code (with or without an external common package) or SQL code will do.
Thanks
回答1:
In SQL, you can do a cumulative sum using the ANSI standard sum() over ()
functionality:
select classes, id, value,
sum(value) over (partition by classes order by id) as cumesum
from t;
回答2:
Or you can use by
from the base
package:
df$cumsum <- unlist(by(df$value, df$classes, cumsum))
# classes id value cumsum
#1 a 1 10 10
#2 a 2 15 25
#3 a 3 12 37
#4 b 1 5 5
#5 b 2 9 14
#6 b 3 7 21
#7 c 1 6 6
#8 c 2 14 20
#9 c 3 6 26
来源:https://stackoverflow.com/questions/35339681/multiple-cumulative-sums