How do I optimize sapply in R to calculate running totals on a dataframe

问题

I wrote a function in R to calculate cumulative totals by month number, but the execution time of my method grows exponentially as the dataset gets larger. I'm a novice R programmer, can you help me make this more efficient?
The function and the way I invoke the function:

accumulate <- function(recordnum,df){
    sumthese <- (df$subject == df$subject[recordnum]) &
        (df$month <= df$month[recordnum])
    sum(df$measurement[sumthese])
}
set.seed(42)
datalength = 10
df <- data.frame(measurement = runif(1:datalength),
                 subject=rep(c("dog","cat"),each =datalength/2),
                 month=rep(seq(datalength/2,1,by=-1)))
system.time(df$cumulative <- sapply(1:datalength,accumulate,df))

The input dataframe:

> df
   measurement subject month
1    0.4577418     dog     5
2    0.7191123     dog     4
3    0.9346722     dog     3
4    0.2554288     dog     2
5    0.4622928     dog     1
6    0.9400145     cat     5
7    0.9782264     cat     4
8    0.1174874     cat     3
9    0.4749971     cat     2
10   0.5603327     cat     1

The output dataframe:

> df
   measurement subject month cumulative
1    0.9148060     dog     5  3.6102141
2    0.9370754     dog     4  2.6954081
3    0.2861395     dog     3  1.7583327
4    0.8304476     dog     2  1.4721931
5    0.6417455     dog     1  0.6417455
6    0.5190959     cat     5  2.7524079
7    0.7365883     cat     4  2.2333120
8    0.1346666     cat     3  1.4967237
9    0.6569923     cat     2  1.3620571
10   0.7050648     cat     1  0.7050648

Notice the cumulative column shows the accumulation of all measurements up to and including the current month. The function does not require the dataframe to be sorted. When the datalength equals 100, the elapsed time is 0.3. 1000 is 0.58. 10,000 = 27.72. I need this to run for 200K+ records.
Thanks!

回答1:

This is non-destructive, i.e. the original df is not modified. No packages are used. The original order of the rows of df is preserved; however, if that is not important then [order(o), ] on the last line can be omitted.

o <- order(df$subject, df$month)
transform(df[o, ], cumulative = ave(measurement, subject, FUN = cumsum))[order(o), ]

giving:

   measurement subject month cumulative
1   0.37955924     dog     5  2.2580530
2   0.43577158     dog     4  1.8784938
3   0.03743103     dog     3  1.4427222
4   0.97353991     dog     2  1.4052912
5   0.43175125     dog     1  0.4317512
6   0.95757660     cat     5  4.0751151
7   0.88775491     cat     4  3.1175385
8   0.63997877     cat     3  2.2297836
9   0.97096661     cat     2  1.5898048
10  0.61883821     cat     1  0.6188382

回答2:

dplyr would make this very easy

library(dplyr)
df %>%
    group_by(subject) %>%
    arrange(month) %>%
    mutate(cumulative = cumsum(measurement))

Source: local data frame [10 x 4]
Groups: subject

   measurement subject month cumulative
1    0.7050648     cat     1  0.7050648
2    0.6569923     cat     2  1.3620571
3    0.1346666     cat     3  1.4967237
4    0.7365883     cat     4  2.2333120
5    0.5190959     cat     5  2.7524079
6    0.6417455     dog     1  0.6417455
7    0.8304476     dog     2  1.4721931
8    0.2861395     dog     3  1.7583327
9    0.9370754     dog     4  2.6954081
10   0.9148060     dog     5  3.6102141

Although if you are looking for absolute performance you probably want to use data.table

library(data.table)
setDT(df)[order(month), cumulative := cumsum(measurement), by=subject]    

#     measurement subject month cumulative
#  1:   0.7050648     cat     1  0.7050648
#  2:   0.6569923     cat     2  1.3620571
#  3:   0.1346666     cat     3  1.4967237
#  4:   0.7365883     cat     4  2.2333120
#  5:   0.5190959     cat     5  2.7524079
#  6:   0.6417455     dog     1  0.6417455
#  7:   0.8304476     dog     2  1.4721931
#  8:   0.2861395     dog     3  1.7583327
#  9:   0.9370754     dog     4  2.6954081
# 10:   0.9148060     dog     5  3.6102141

回答3:

Rather than using a custom function, why not use the built-in R functions by and cumsum?

df <- df[order(df$subject,df$month),]
df <- cbind(df,
            cumulative=do.call(what=c,
                               args=by(data=df$measurement,
                               INDICES=df$subject,
                               FUN=cumsum)))
print(df)

   measurement subject month cumulative
10   0.7050648     cat     1  0.7050648
9    0.6569923     cat     2  1.3620571
8    0.1346666     cat     3  1.4967237
7    0.7365883     cat     4  2.2333120
6    0.5190959     cat     5  2.7524079
5    0.6417455     dog     1  0.6417455
4    0.8304476     dog     2  1.4721931
3    0.2861395     dog     3  1.7583327
2    0.9370754     dog     4  2.6954081
1    0.9148060     dog     5  3.6102141

cumsum creates cumulative sums, and by lets you do by-group processing (returning a list - an alternative is aggreagate, which gives you a data frame). As long as the data is ordered correctly, this gives you the right data.

回答4:

This function takes a vector of measurements and months, figures out the how to order the data by month, then calculcates the cummulative sum of measure ordered by month, returning to original order (using the fact that (x[o])[order(o)] == x))

FUN <- function(measure, month) {
    o <- order(month)
    cumsum(measure[o])[order(o)]
}

So if you were to split your measurement and month into a list based on subject, you could map each element from old values to new

Map(FUN, split(df$measurement, df$subject), split(df$month, df$subject))

Provided the implied 'geometry' is consistent, split()<- does the book-keeping to assign the list-of-values to their correct location in a vector

df$cumulative <- NA_real_   # or add this column to df's construction
split(df$cumulative, df$subject) <- 
    Map(FUN, split(df$measurement, df$subject), split(df$month, df$subject))

I think so far this is the only solution that preserves the original order of the data (presumably steps could be added to other solutions...)

This seems to scale linearly, at least as the number of rows gets large

f0 <- function(df) {
    split(df$cumulative, df$subject) <- 
        Map(FUN, split(df$measurement, df$subject), split(df$month, df$subject))
    df
}

df <- lapply(10^(3:6), function(datalength) {
    data.frame(measurement = runif(1:datalength),
               subject=rep(c("dog","cat"),each =datalength/2),
               month=rep(seq(datalength/2,1,by=-1)),
               cumulative=rep(NA_real_, datalength))
})

library(microbenchmark)

and then

> microbenchmark(f0(df[[1]]), f0(df[[2]]), f0(df[[3]]), f0(df[[4]]))
Unit: microseconds
        expr        min          lq        mean      median          uq
 f0(df[[1]])    503.076    523.5275    576.4077    574.7825    612.9585
 f0(df[[2]])   2701.103   2769.3830   2869.0045   2847.1190   2922.0120
 f0(df[[3]])  26673.878  27184.7980  27894.5087  27547.5595  28595.6775
 f0(df[[4]]) 283416.456 285104.5225 292142.5274 290043.3785 295415.6995
        max neval
    913.945   100
   3296.594   100
  35015.903   100
 342556.407   100

来源：https://stackoverflow.com/questions/29128275/how-do-i-optimize-sapply-in-r-to-calculate-running-totals-on-a-dataframe

标签

performance

dataframe

sapply

processing-efficiency