问题
I wrote a function in R to calculate cumulative totals by month number, but the execution time of my method grows exponentially as the dataset gets larger. I'm a novice R programmer, can you help me make this more efficient?
The function and the way I invoke the function:
accumulate <- function(recordnum,df){
sumthese <- (df$subject == df$subject[recordnum]) &
(df$month <= df$month[recordnum])
sum(df$measurement[sumthese])
}
set.seed(42)
datalength = 10
df <- data.frame(measurement = runif(1:datalength),
subject=rep(c("dog","cat"),each =datalength/2),
month=rep(seq(datalength/2,1,by=-1)))
system.time(df$cumulative <- sapply(1:datalength,accumulate,df))
The input dataframe:
> df
measurement subject month
1 0.4577418 dog 5
2 0.7191123 dog 4
3 0.9346722 dog 3
4 0.2554288 dog 2
5 0.4622928 dog 1
6 0.9400145 cat 5
7 0.9782264 cat 4
8 0.1174874 cat 3
9 0.4749971 cat 2
10 0.5603327 cat 1
The output dataframe:
> df
measurement subject month cumulative
1 0.9148060 dog 5 3.6102141
2 0.9370754 dog 4 2.6954081
3 0.2861395 dog 3 1.7583327
4 0.8304476 dog 2 1.4721931
5 0.6417455 dog 1 0.6417455
6 0.5190959 cat 5 2.7524079
7 0.7365883 cat 4 2.2333120
8 0.1346666 cat 3 1.4967237
9 0.6569923 cat 2 1.3620571
10 0.7050648 cat 1 0.7050648
Notice the cumulative column shows the accumulation of all measurements up to and including the current month. The function does not require the dataframe to be sorted. When the datalength equals 100, the elapsed time is 0.3. 1000 is 0.58. 10,000 = 27.72. I need this to run for 200K+ records.
Thanks!
回答1:
This is non-destructive, i.e. the original df
is not modified. No packages are used. The original order of the rows of df
is preserved; however, if that is not important then [order(o), ]
on the last line can be omitted.
o <- order(df$subject, df$month)
transform(df[o, ], cumulative = ave(measurement, subject, FUN = cumsum))[order(o), ]
giving:
measurement subject month cumulative
1 0.37955924 dog 5 2.2580530
2 0.43577158 dog 4 1.8784938
3 0.03743103 dog 3 1.4427222
4 0.97353991 dog 2 1.4052912
5 0.43175125 dog 1 0.4317512
6 0.95757660 cat 5 4.0751151
7 0.88775491 cat 4 3.1175385
8 0.63997877 cat 3 2.2297836
9 0.97096661 cat 2 1.5898048
10 0.61883821 cat 1 0.6188382
回答2:
dplyr
would make this very easy
library(dplyr)
df %>%
group_by(subject) %>%
arrange(month) %>%
mutate(cumulative = cumsum(measurement))
Source: local data frame [10 x 4]
Groups: subject
measurement subject month cumulative
1 0.7050648 cat 1 0.7050648
2 0.6569923 cat 2 1.3620571
3 0.1346666 cat 3 1.4967237
4 0.7365883 cat 4 2.2333120
5 0.5190959 cat 5 2.7524079
6 0.6417455 dog 1 0.6417455
7 0.8304476 dog 2 1.4721931
8 0.2861395 dog 3 1.7583327
9 0.9370754 dog 4 2.6954081
10 0.9148060 dog 5 3.6102141
Although if you are looking for absolute performance you probably want to use data.table
library(data.table)
setDT(df)[order(month), cumulative := cumsum(measurement), by=subject]
# measurement subject month cumulative
# 1: 0.7050648 cat 1 0.7050648
# 2: 0.6569923 cat 2 1.3620571
# 3: 0.1346666 cat 3 1.4967237
# 4: 0.7365883 cat 4 2.2333120
# 5: 0.5190959 cat 5 2.7524079
# 6: 0.6417455 dog 1 0.6417455
# 7: 0.8304476 dog 2 1.4721931
# 8: 0.2861395 dog 3 1.7583327
# 9: 0.9370754 dog 4 2.6954081
# 10: 0.9148060 dog 5 3.6102141
回答3:
Rather than using a custom function, why not use the built-in R functions by
and cumsum
?
df <- df[order(df$subject,df$month),]
df <- cbind(df,
cumulative=do.call(what=c,
args=by(data=df$measurement,
INDICES=df$subject,
FUN=cumsum)))
print(df)
measurement subject month cumulative
10 0.7050648 cat 1 0.7050648
9 0.6569923 cat 2 1.3620571
8 0.1346666 cat 3 1.4967237
7 0.7365883 cat 4 2.2333120
6 0.5190959 cat 5 2.7524079
5 0.6417455 dog 1 0.6417455
4 0.8304476 dog 2 1.4721931
3 0.2861395 dog 3 1.7583327
2 0.9370754 dog 4 2.6954081
1 0.9148060 dog 5 3.6102141
cumsum
creates cumulative sums, and by
lets you do by-group processing (returning a list - an alternative is aggreagate
, which gives you a data frame). As long as the data is ordered correctly, this gives you the right data.
回答4:
This function takes a vector of measurements and months, figures out the how to order the data by month, then calculcates the cummulative sum of measure ordered by month, returning to original order (using the fact that (x[o])[order(o)] == x)
)
FUN <- function(measure, month) {
o <- order(month)
cumsum(measure[o])[order(o)]
}
So if you were to split your measurement and month into a list based on subject, you could map each element from old values to new
Map(FUN, split(df$measurement, df$subject), split(df$month, df$subject))
Provided the implied 'geometry' is consistent, split()<-
does the book-keeping to assign the list-of-values to their correct location in a vector
df$cumulative <- NA_real_ # or add this column to df's construction
split(df$cumulative, df$subject) <-
Map(FUN, split(df$measurement, df$subject), split(df$month, df$subject))
I think so far this is the only solution that preserves the original order of the data (presumably steps could be added to other solutions...)
This seems to scale linearly, at least as the number of rows gets large
f0 <- function(df) {
split(df$cumulative, df$subject) <-
Map(FUN, split(df$measurement, df$subject), split(df$month, df$subject))
df
}
df <- lapply(10^(3:6), function(datalength) {
data.frame(measurement = runif(1:datalength),
subject=rep(c("dog","cat"),each =datalength/2),
month=rep(seq(datalength/2,1,by=-1)),
cumulative=rep(NA_real_, datalength))
})
library(microbenchmark)
and then
> microbenchmark(f0(df[[1]]), f0(df[[2]]), f0(df[[3]]), f0(df[[4]]))
Unit: microseconds
expr min lq mean median uq
f0(df[[1]]) 503.076 523.5275 576.4077 574.7825 612.9585
f0(df[[2]]) 2701.103 2769.3830 2869.0045 2847.1190 2922.0120
f0(df[[3]]) 26673.878 27184.7980 27894.5087 27547.5595 28595.6775
f0(df[[4]]) 283416.456 285104.5225 292142.5274 290043.3785 295415.6995
max neval
913.945 100
3296.594 100
35015.903 100
342556.407 100
来源:https://stackoverflow.com/questions/29128275/how-do-i-optimize-sapply-in-r-to-calculate-running-totals-on-a-dataframe