Running Rounding

时间秒杀一切 提交于 2019-12-08 12:30:48

问题


I am trying to implement rounding over a column in a way that running sum of rounded values matches the running sum of original values within a group.

Sample data for the task has three columns:

  • numbers - values that I need to round;
  • ids - define order of values, can be date for time series data;
  • group - defines the group within which I need to round the numbers.

Here is a data sample, already ordered by ids within a group:

       numbers  ids group
       35.07209 1   1
       27.50931 2   1
       70.62019 3   1
       99.55451 6   1
       34.40472 8   1
       17.58864 10  1
       93.66178 4   3
       83.21700 5   3
       63.89058 7   3
       88.96561 9   3

To generate sample data for testing I use this code:

  # Make data sample.
  x.size <- 10^6
  x <- list("numbers" = runif(x.size) * 100, "ids" = 1:x.size, "group" = ifelse(runif(x.size) > 0.2 ,1, ifelse(runif(x.size) > 0.8, 2, 3)))
  x<- data.frame(x)
  x <- x[order(x$group), ]

I wrote a function that keeps the state of rounding within a group, to make sure that the total value of round values is correct:

makeRunRound <- function() {
  # Data must be sorted by id.
  cumDiff <- 0
  savedId <- 0

  function(x, id) {
  # id here represents the group.

    if(id != savedId) {
      cumDiff <<- 0
      savedId <<- id
    }

    xInt <- floor(x)
    cumDiff <<- x - xInt + cumDiff

    if(cumDiff > 1) {
      xInt <- xInt + round(cumDiff)
      cumDiff <<- cumDiff - round(cumDiff)
    }
    return (xInt)
  }
}

runRound <- makeRunRound()

This approach works and I would be happy about it if not for the speed.

It takes 2-3 second to complete running rounding on a 1m records sample.

This is too long for me and there is another way explained in this question which works six times faster. I keep the code as given in the answer by josliber:

smartRound <- function(x) {
  y <- floor(x)
  indices <- tail(order(x-y), round(sum(x)) - sum(y))
  y[indices] <- y[indices] + 1
  y
}

Using the sample data generated by the code above, benchmarking:

# Code to benchmark speed.
library(microbenchmark)
res <- microbenchmark(
  "run.df" = x$mrounded <- mapply(FUN=runRound, x$numbers, x$group),
  "run.dt" = u <- x.dt[, .(rounded = runRound(numbers, group)), by = .(group, ids)],
  "smart.df" = x$smart.round <- smartRound(x$numbers),
  "smart.dt"= smart.round.dt <- x.dt[, .(rounded = smartRound(numbers)), by = .(group)],
  "silly" = x$silly.round <- round(x$numbers),
  times = 50
)
print(res)
boxplot(res)

, produces these results:

Unit: milliseconds
     expr        min         lq       mean     median         uq        max neval
   run.df 3475.69545 3827.13649 3994.09184 3967.27759 4179.67702 4472.18679    50
   run.dt 2449.05820 2633.52337 2895.51040 2881.87608 3119.42219 3617.67113    50
 smart.df  488.70854  537.03179  576.57704  567.63077  611.81271  861.76436    50
 smart.dt  390.35646  414.96749  468.95317  457.85820  507.54395  631.17081    50
    silly   13.72486   15.82744   19.41796   17.19057   18.85385   88.06329    50

So, speed changes from 20ms for the cell level rounding to 2.6s for the method that respects the running total of rounded values within the group.

I have included comparison of the calculations based on the data.frame and data.table to demonstrate that there is no major difference, even though data.table slightly improves performance.

I really appreciate the simplicity and the speed of the smartRound, but it does not respect the order of the items, hence result will different from what I need.

Is there a way to:

  • either, modify smartRound in a way that will achieve the same results as runRound without loosing the performance?
  • or, modify runRound to improve performance?
  • or, is there another better solution all together?

EDIT:

dww answer gives the fastest solution:

diffRound <- function(x) { 
  diff(c(0, round(cumsum(x)))) 
}

I have reduced the test to four options:

res <- microbenchmark(
  "silly" = x$silly.round <- round(x$numbers),
  "diff(dww)" = smart.round.dt <- x.dt[, .(rounded = diffRound(numbers)), by = .(group)] ,
  "smart.dt"= smart.round.dt <- x.dt[, .(rounded = smartRound(numbers)), by = .(group)],
  "run.dt" = u <- x.dt[, .(rounded = runRound(numbers, group)), by = .(group, ids)],
  times = 50
)

New results:

Unit: milliseconds
      expr        min         lq       mean     median         uq        max neval
     silly   14.67823   16.64882   17.31416   16.83338   17.67497   22.48689    50
 diff(dww)   54.57762   70.11553   76.67135   71.37325   76.83717  139.18745    50
  smart.dt  392.83240  408.65768  456.46592  441.33212  492.67824  592.57723    50
    run.dt 2564.02724 2651.13994 2751.80516 2708.45317 2830.44553 3101.71005    50

Thanks to dww, I have 6x performance gain without loosing the precision.


回答1:


I would do it this way, with simple base vectorised functions:

first calculate the running total of the original numbers, and the rounded value of that running total. Then find a list of numbers that add up to this rounded running total using diff() to see how each rounded sum is larger than the last.

cum.sum <- cumsum(x$numbers)
cum.sum.rounded <- round(cum.sum)
numbers.round <- diff(cum.sum.rounded)
numbers.round <- c(cum.sum.rounded[1], numbers.round)

Check that all is as you want it:

check.cs <- cumsum(numbers.round)
all( abs(check.cs - cum.sum) <=1 )
#TRUE


来源:https://stackoverflow.com/questions/36969907/running-rounding

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!