Allocating tasks to parallel workers so that expected cost is roughly equal

问题

I have an assignment problem where I'm trying to allocate a number of tasks with a known expected cost (runtime in seconds) to X parallel workers, subject to the constraint that each worker receives the same number of tasks (save for remainders), so that the total expected runtime per worker is roughly equal.

I'm using a data frame that defines the tasks to be executed, and for each task I can calculate a pretty accurate expected cost (runtime in seconds). E.g. something like this:

library("tibble")

set.seed(1232)
tasks <- tibble(task = 1:20, cost = runif(20, min = 1, max = 5)^2)
head(tasks)
#> # A tibble: 6 x 2
#>    task  cost
#>   <int> <dbl>
#> 1     1 22.5 
#> 2     2 20.0 
#> 3     3 21.3 
#> 4     4  8.13
#> 5     5 18.3 
#> 6     6 19.6

^{Created on 2019-11-21 by the reprex package (v0.3.0)}

This is then used with foreach::foreach(...) %dopar% ... to execute the tasks in parallel. foreach() splits the tasks into roughly equal sized groups with size nrow(tasks)/X where X is the number of parallel workers (cores).

I'm currently shuffling the task list so that the cost is roughly equal for each worker, but there can still be substantial deviations, i.e. some workers get finished much earlier than others and thus it would have been better if they had had some more costly tasks. E.g.:

# shuffle tasks (in the original application cost is not random initially)
tasks <- tasks[sample(1:nrow(tasks)), ]

# number of workers
X <- 4
tasks$worker <- rep(1:X, each = nrow(tasks)/X)

# expected total cost (runtime in s) per worker
sapply(split(tasks$cost, tasks$worker), sum)
#>        1        2        3        4 
#> 77.25278 35.25026 66.09959 64.05435

^{Created on 2019-11-21 by the reprex package (v0.3.0)}

The second worker finishes in half the time as the other workers, so its capacity is wasted and the thing overall takes longer to finish.

What I'd like to do instead is have a way of re-ordering the task data frame so that when foreach splits it into X groups the total expected cost per group is more even.

I imagine this is a super-well known kind of problem and I just don't know the right verbiage to google (nor how to do it in R). Thanks for any help.

(EDIT) Mostly better alternative

For now, a relatively simple alternative that seems to do better than random shuffling. This orders the tasks by cost, assigns the first X tasks to workers 1 to X, then assigns the next chunk of X tasks in reverse order to workers X to 1, etc (this is "alt1" below).

(EDIT2) Added the RcppAlgos method

By Joseph Wood below.

library("tibble")
library("dplyr")
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library("ggplot2")
library("tidyr")
library("RcppAlgos")

getPartitions <- function(df, nWorkers, tol_ratio = 0.0001) {

  nSections <- nrow(df) / nWorkers
  avg <- sum(df$cost) / nWorkers
  tol <- avg * tol_ratio
  vec <- df$cost
  cond <- TRUE
  part <- list()

  for (i in 1:(nWorkers - 1)) {
      while (cond) {
          vals <- comboGeneral(vec, nSections,
                               constraintFun = "sum",
                               comparisonFun = "==",
                               limitConstraints = avg + (tol / 2),
                               tolerance = tol,
                               upper = 1)

          cond <- nrow(vals) == 0

          if (cond) {
              tol <- tol * 2
          } else {
              v <- match(as.vector(vals), df$cost)
          }
      }

      part[[i]] <- v
      vec <- df$cost[-(do.call(c, part))]
      avg <- sum(vec) / (nWorkers - i)
      tol <- avg * tol_ratio
      cond <- TRUE
  }

  part[[nWorkers]] <- which(!1:nrow(df) %in% do.call(c, part))
  part
}

race <- function() {
  N_TASKS = 100
  X = 4
  tasks <- tibble(task = 1:N_TASKS, cost = runif(N_TASKS, min = 1, max = 10)^2)

  # random shuffle
  tasks$worker <- rep(1:X, each = nrow(tasks)/X)
  rando <- max(sapply(split(tasks$cost, tasks$worker), sum))

  # alternative 1
  tasks <- tasks[order(tasks$cost), ]
  tasks$worker <- rep(c(1:X, X:1), length.out = nrow(tasks))
  alt1 <- max(sapply(split(tasks$cost, tasks$worker), sum))

  # modified version of ivan100sic's answer
  # sort by descending cost, after initial allocation, allocate costly tasks
  # first to workers with lowest total cost so far
  group <- factor(rep(1:(ceiling(nrow(tasks)/4)), each = X))
  tasks <- tasks[order(tasks$cost, decreasing = TRUE), ]
  tasks$worker <- c(1:X, rep(NA, length.out = nrow(tasks) - X))
  task_sets <- split(tasks, group)
  task_sets[[1]]$worker <- 1:X
  for (i in 2:length(task_sets)) {
    # get current total cost by worker
    total <- task_sets %>% 
      bind_rows() %>%
      filter(!is.na(worker)) %>%
      group_by(worker) %>%
      summarize(cost = sum(cost)) %>%
      arrange(cost)
    task_sets[[i]]$worker <- total[["worker"]]
  }
  tasks <- bind_rows(task_sets)
  alt2  <- max(sapply(split(tasks$cost, tasks$worker), sum))

  # RcppAlogs by Joseph Wood below
  setParts <- getPartitions(tasks, X)
  worker   <- rep(1:4, each = N_TASKS/X)
  row_num  <- unsplit(setParts, worker)
  tasks$worker <- worker[order(row_num)]
  algo <- max(sapply(split(tasks$cost, tasks$worker), sum))


  c(ref = sum(tasks$cost) / X, rando = rando, alt1 = alt1, alt2 = alt2, algo = algo)
}

set.seed(24332)
sims <- replicate(1e3, race())
sims <- sims %>%
  t() %>%
  as_tibble() %>%
  pivot_longer(rando:algo, names_to = "Method")

ggplot(sims, aes(x = value, color = Method)) + 
  geom_density() +
  scale_x_continuous(limits = c(0, max(sims$value))) +
  labs(x = "Total runtime (s)")


# this shows the estimated runtime relative to average total cost
# per worker (which may be unobtainable)
sims %>%
  group_by(Method) %>%
  summarize(time_relative_to_ref = mean(value - ref)) %>%
  arrange(time_relative_to_ref)
#> # A tibble: 4 x 2
#>   Method time_relative_to_ref
#>   <chr>                 <dbl>
#> 1 algo                 0.0817
#> 2 alt2                 0.307 
#> 3 alt1                 4.97  
#> 4 rando              154.

^{Created on 2020-02-04 by the reprex package (v0.3.0)}

"rando": randomly shuffle the task list
"alt1": sort tasks by cost and alternate assigning to worker 1 to X, X to 1, etc.
"alt2": based on ivan100sic's answer below, after the first allocation to workers 1 to X, allocate based on total cost per worker so far
"algo": based on Joseph Woods's answer below

回答1:

As @JohnColeman points out, this essentially boils down to partitioning. We are trying to partition the tasks equally such that the sum of the cost doesn't vary wildly.

The algorithm below does just that. The main idea is to successively find a set of tasks whose sum is close to the average. Once we find one, we remove them, and continue selecting.

The work horse of the algorithm below is comboGeneral from RcppAlgos^*. This function allows one to find combinations of a vector meeting a constraint. In this case, we are looking for 5 numbers whose sum is close to sum(tasks$cost) / (number of workers) ~ 60.66425. Since we are looking for numbers close to and not exact, we can bound our constraint. That is, we can look for combinations such that the sum is within a given tolerance.

library(RcppAlgos)

getPartitions <- function(df, nWorkers, tol_ratio = 0.0001) {

    nSections <- nrow(df) / nWorkers
    avg <- sum(df$cost) / nWorkers
    tol <- avg * tol_ratio
    vec <- df$cost
    cond <- TRUE
    part <- list()

    for (i in 1:(nWorkers - 1)) {
        while (cond) {
            vals <- comboGeneral(vec, nSections,
                                 constraintFun = "sum",
                                 comparisonFun = "==",
                                 limitConstraints = avg + (tol / 2),
                                 tolerance = tol,
                                 upper = 1)

            cond <- nrow(vals) == 0

            if (cond) {
                tol <- tol * 2
            } else {
                v <- match(as.vector(vals), df$cost)
            }
        }

        part[[i]] <- v
        vec <- df$cost[-(do.call(c, part))]
        avg <- sum(vec) / (nWorkers - i)
        tol <- avg * tol_ratio
        cond <- TRUE
    }

    part[[nWorkers]] <- which(!1:nrow(df) %in% do.call(c, part))
    part
}

The output for the example given by the OP is as follows:

getPartitions(tasks, 4)
[[1]]
[1] 11 13  8 14 10

[[2]]
[1] 12  4 20  2 16

[[3]]
[1] 19  9 18  5  6

[[4]]
[1]  1  3  7 15 17

These are the rows from tasks that are to be passed to each worker. It runs instantly and returns a pretty even workload. Here are the estimated times for each worker:

sapply(getPartitions(tasks, 4), function(x) {
    sum(tasks$cost[x])
})
[1] 60.67292 60.66552 60.80399 60.51455

This is pretty good given that the ideal time would be mean(tasks$cost) * 5 ~= 60.66425.

Let's see how it performs. Below is a modified script for plotting that takes into account how varied each result is for a given method. We measure this with sd (standard deviation). It also returns the ideal solution for reference.

library("tibble")
library("dplyr")
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library("ggplot2")
library("tidyr")

race <- function() {
    N_TASKS = 100
    X = 4
    tasks <- tibble(task = 1:N_TASKS, cost = runif(N_TASKS, min = 1, max = 10)^2)
    ideal_soln <- sum(tasks$cost) / X

    # random shuffle
    tasks$worker <- rep(1:X, each = nrow(tasks)/X)
    rando_mx <- max(sapply(split(tasks$cost, tasks$worker), sum))
    rando_sd <- sd(sapply(split(tasks$cost, tasks$worker), sum))

    # alternative 1
    tasks <- tasks[order(tasks$cost), ]
    tasks$worker <- rep(c(1:X, X:1), length.out = nrow(tasks))
    alt1_mx <- max(sapply(split(tasks$cost, tasks$worker), sum))
    alt1_sd <- sd(sapply(split(tasks$cost, tasks$worker), sum))

    # modified version of ivan100sic's answer
    # sort by descending cost, after initial allocation, allocate costly tasks
    # first to workers with lowest total cost so far
    group <- factor(rep(1:(ceiling(nrow(tasks)/4)), each = X))
    tasks <- tasks[order(tasks$cost, decreasing = TRUE), ]
    tasks$worker <- c(1:X, rep(NA, length.out = nrow(tasks) - X))
    task_sets <- split(tasks, group)
    task_sets[[1]]$worker <- 1:X
    for (i in 2:length(task_sets)) {
        # get current total cost by worker
        total <- task_sets %>% 
            bind_rows() %>%
            filter(!is.na(worker)) %>%
            group_by(worker) %>%
            summarize(cost = sum(cost)) %>%
            arrange(cost)
        task_sets[[i]]$worker <- total[["worker"]]
    }
    tasks <- bind_rows(task_sets)
    alt2_mx  <- max(sapply(split(tasks$cost, tasks$worker), sum))
    alt2_sd  <- sd(sapply(split(tasks$cost, tasks$worker), sum))

    ## RcppAlgos solution
    setParts <- getPartitions(tasks, X)
    algos_mx <- max(sapply(setParts, function(x) sum(tasks$cost[x])))
    algos_sd <- sd(sapply(setParts, function(x) sum(tasks$cost[x])))

    c(target_soln = ideal_soln,rando_max = rando_mx, alt1_max = alt1_mx,
      alt2_max = alt2_mx, algos_max = algos_mx, rando_std_dev = rando_sd,
      alt1_std_dev = alt1_sd, alt2_std_dev = alt2_sd, algos_std_dev = algos_sd)
}

set.seed(24332)
system.time(sims <- replicate(1e3, race()))
sims %>%
    t() %>%
    as_tibble() %>%
    pivot_longer(rando_std_dev:algos_std_dev, names_to = "Method") %>%
    ggplot(aes(x = value, color = Method)) + 
    geom_density() +
    scale_x_continuous(limits = c(0, 100)) +
    labs(x = "Standard Deviation (s)")
Warning message:
Removed 719 rows containing non-finite values (stat_density).

It is hard to tell what is going on because the standard deviation for the rando method is so large. If we just look at alt1, alt2, and the algos approach we have:

sims %>%
    t() %>%
    as_tibble() %>%
    pivot_longer(alt1_std_dev:algos_std_dev, names_to = "Method") %>%
    ggplot(aes(x = value, color = Method)) + 
    geom_density() +
    scale_x_continuous(limits = c(0, 5)) +
    labs(x = "Standard Deviation (s)")
Warning message:
Removed 335 rows containing non-finite values (stat_density)

And now alt2 and algos:

sims %>%
    t() %>%
    as_tibble() %>%
    pivot_longer(alt2_std_dev:algos_std_dev, names_to = "Method") %>%
    ggplot(aes(x = value, color = Method)) + 
    geom_density() +
    scale_x_continuous(limits = c(0, 1.7)) +
    labs(x = "Standard Deviation (s)")

As you can see, the RcppAlgos solution gives the most balanced load every time.

And finally, here is an illustration that demonstrates how close each method is to the target solution:

summary(abs(t(sims)[, "algos_max"] - t(sims)[, "target_soln"]))
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.003147 0.057913 0.081986 0.081693 0.106312 0.179099 

summary(abs(t(sims)[, "alt2_max"] - t(sims)[, "target_soln"]))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.01175 0.14321 0.23916 0.30730 0.40949 2.03156

summary(abs(t(sims)[, "alt1_max"] - t(sims)[, "target_soln"]))
  Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.4979  2.9815  4.4725  4.9660  6.3220 16.5716 

summary(abs(t(sims)[, "rando_max"] - t(sims)[, "target_soln"]))
 Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
13.16   98.14  143.64  154.10  200.41  427.81

We see that the RcppAlgos solution is around 3-4 times closer on average to the target solution than the second best method (alt2 in this case).

Update

For the most part, the alt2/alt1 methods perform relatively well and are very simple, which is a huge plus. However, there are many cases where they will fail. For example, given X workers and X - 1 tasks that you know take a substantially longer time than the other tasks, since those methods rely on sorting, they will predictably allocate too much to X - 1 workers. Simply change the following line in the function race():

## Original
tasks <- tibble(task = 1:N_TASKS, cost = runif(N_TASKS, min = 1, max = 10)^2)

## Modified
tasks <- tibble(task = 1:N_TASKS, cost = c(runif(X - 1, 15, 25),
                                           runif(N_TASKS - X + 1, min = 1, max = 10))^2)

Now rerun and observe:

set.seed(24332)
sims <- replicate(1e3, race())
sims <- sims %>%
    t() %>%
    as_tibble() %>%
    pivot_longer(rando:algo, names_to = "Method")

ggplot(sims, aes(x = value, color = Method)) + 
    geom_density() +
    scale_x_continuous(limits = c(0, max(sims$value))) +
    labs(x = "Total runtime with Large Gap (s)")

sims %>%
    group_by(Method) %>%
    summarize(time_relative_to_ref = mean(value - ref)) %>%
    arrange(time_relative_to_ref)
# A tibble: 4 x 2
Method time_relative_to_ref
<chr>                 <dbl>
1 algo                  0.109
2 alt2                150.   
3 alt1                184.   
4 rando               839.

Although this is a contrived example, it shows that since the alt1/alt2 solutions makes assumptions about the underlying data, it will inevitably fail when presented with a more general problem.

^* Disclosure: I am the author of RcppAlgos

回答2:

The following heuristic might give you good results:

Sort all the tasks by cost in descending order. For each task, assign it to the worker which has the minimum total assigned cost so far.

来源：https://stackoverflow.com/questions/58975589/allocating-tasks-to-parallel-workers-so-that-expected-cost-is-roughly-equal

标签

combinatorics