问题
I am trying to implement rounding over a column in a way that running sum of rounded values matches the running sum of original values within a group.
Sample data for the task has three columns:
- numbers - values that I need to round;
- ids - define order of values, can be date for time series data;
- group - defines the group within which I need to round the numbers.
Here is a data sample, already ordered by ids within a group:
numbers ids group
35.07209 1 1
27.50931 2 1
70.62019 3 1
99.55451 6 1
34.40472 8 1
17.58864 10 1
93.66178 4 3
83.21700 5 3
63.89058 7 3
88.96561 9 3
To generate sample data for testing I use this code:
# Make data sample.
x.size <- 10^6
x <- list("numbers" = runif(x.size) * 100, "ids" = 1:x.size, "group" = ifelse(runif(x.size) > 0.2 ,1, ifelse(runif(x.size) > 0.8, 2, 3)))
x<- data.frame(x)
x <- x[order(x$group), ]
I wrote a function that keeps the state of rounding within a group, to make sure that the total value of round values is correct:
makeRunRound <- function() {
# Data must be sorted by id.
cumDiff <- 0
savedId <- 0
function(x, id) {
# id here represents the group.
if(id != savedId) {
cumDiff <<- 0
savedId <<- id
}
xInt <- floor(x)
cumDiff <<- x - xInt + cumDiff
if(cumDiff > 1) {
xInt <- xInt + round(cumDiff)
cumDiff <<- cumDiff - round(cumDiff)
}
return (xInt)
}
}
runRound <- makeRunRound()
This approach works and I would be happy about it if not for the speed.
It takes 2-3 second to complete running rounding on a 1m records sample.
This is too long for me and there is another way explained in this question which works six times faster. I keep the code as given in the answer by josliber:
smartRound <- function(x) {
y <- floor(x)
indices <- tail(order(x-y), round(sum(x)) - sum(y))
y[indices] <- y[indices] + 1
y
}
Using the sample data generated by the code above, benchmarking:
# Code to benchmark speed.
library(microbenchmark)
res <- microbenchmark(
"run.df" = x$mrounded <- mapply(FUN=runRound, x$numbers, x$group),
"run.dt" = u <- x.dt[, .(rounded = runRound(numbers, group)), by = .(group, ids)],
"smart.df" = x$smart.round <- smartRound(x$numbers),
"smart.dt"= smart.round.dt <- x.dt[, .(rounded = smartRound(numbers)), by = .(group)],
"silly" = x$silly.round <- round(x$numbers),
times = 50
)
print(res)
boxplot(res)
, produces these results:
Unit: milliseconds
expr min lq mean median uq max neval
run.df 3475.69545 3827.13649 3994.09184 3967.27759 4179.67702 4472.18679 50
run.dt 2449.05820 2633.52337 2895.51040 2881.87608 3119.42219 3617.67113 50
smart.df 488.70854 537.03179 576.57704 567.63077 611.81271 861.76436 50
smart.dt 390.35646 414.96749 468.95317 457.85820 507.54395 631.17081 50
silly 13.72486 15.82744 19.41796 17.19057 18.85385 88.06329 50
So, speed changes from 20ms for the cell level rounding to 2.6s for the method that respects the running total of rounded values within the group.
I have included comparison of the calculations based on the data.frame
and data.table
to demonstrate that there is no major difference, even though data.table
slightly improves performance.
I really appreciate the simplicity and the speed of the smartRound
, but it does not respect the order of the items, hence result will different from what I need.
Is there a way to:
- either, modify
smartRound
in a way that will achieve the same results asrunRound
without loosing the performance? - or, modify
runRound
to improve performance? - or, is there another better solution all together?
EDIT:
dww answer gives the fastest solution:
diffRound <- function(x) {
diff(c(0, round(cumsum(x))))
}
I have reduced the test to four options:
res <- microbenchmark(
"silly" = x$silly.round <- round(x$numbers),
"diff(dww)" = smart.round.dt <- x.dt[, .(rounded = diffRound(numbers)), by = .(group)] ,
"smart.dt"= smart.round.dt <- x.dt[, .(rounded = smartRound(numbers)), by = .(group)],
"run.dt" = u <- x.dt[, .(rounded = runRound(numbers, group)), by = .(group, ids)],
times = 50
)
New results:
Unit: milliseconds
expr min lq mean median uq max neval
silly 14.67823 16.64882 17.31416 16.83338 17.67497 22.48689 50
diff(dww) 54.57762 70.11553 76.67135 71.37325 76.83717 139.18745 50
smart.dt 392.83240 408.65768 456.46592 441.33212 492.67824 592.57723 50
run.dt 2564.02724 2651.13994 2751.80516 2708.45317 2830.44553 3101.71005 50
Thanks to dww, I have 6x performance gain without loosing the precision.
回答1:
I would do it this way, with simple base vectorised functions:
first calculate the running total of the original numbers, and the rounded value of that running total. Then find a list of numbers that add up to this rounded running total using diff() to see how each rounded sum is larger than the last.
cum.sum <- cumsum(x$numbers)
cum.sum.rounded <- round(cum.sum)
numbers.round <- diff(cum.sum.rounded)
numbers.round <- c(cum.sum.rounded[1], numbers.round)
Check that all is as you want it:
check.cs <- cumsum(numbers.round)
all( abs(check.cs - cum.sum) <=1 )
#TRUE
来源:https://stackoverflow.com/questions/36969907/running-rounding