How to share work roughly evenly between processes in MPI despite the array_size not being cleanly divisible by the number of processes?

前端未结

关注

 7  1553

Hi all, I have an array of length N, and I\'d like to divide it as best as possible between \'size\' processors. N/size has a remainder, e.g. 1000 array elements divided b

相关标签:

7条回答

天涯浪人

2021-01-04 10:51

I had a similar problem, and here is my non optimum solution with Python and mpi4py API. An optimum solution would take into account how the processors are laid out, here extra work is ditributed to lower ranks. The uneven workload only differ by one task, so it should not be a big deal in general.

from mpi4py import MPI
import sys
def get_start_end(comm,N):
    """
    Distribute N consecutive things (rows of a matrix , blocks of a 1D array)
    as evenly as possible over a given communicator.
    Uneven workload (differs by 1 at most) is on the initial ranks.

    Parameters
    ----------
    comm: MPI communicator
    N:  int
    Total number of things to be distributed.

    Returns
    ----------
    rstart: index of first local row
    rend: 1 + index of last row

    Notes
    ----------
    Index is zero based.
    """

    P      = comm.size
    rank   = comm.rank
    rstart = 0
    rend   = N
    if P >= N:
        if rank < N:
            rstart = rank
            rend   = rank + 1
        else:
            rstart = 0
            rend   = 0
    else:
        n = N//P # Integer division PEP-238
        remainder = N%P
        rstart    = n * rank
        rend      = n * (rank+1)
        if remainder:
            if rank >= remainder:
                rstart += remainder
                rend   += remainder
            else:
                rstart += rank
                rend   += rank + 1
    return rstart, rend

if __name__ == '__main__':
    comm = MPI.COMM_WORLD
    n = int(sys.argv[1])
    print(comm.rank,get_start_end(comm,n))

0 讨论(0)

执笔经年

2021-01-04 10:57

Here's a closed-form solution.

Let N = array length and P = number of processors.

From j = 0 to P-1,

Starting point of array on processor j = floor(N * j / P)

Length of array on processor j = floor(N * (j + 1) / P) – floor(N * j / P)

0 讨论(0)
发布评论:

提交评论
- 加载中...

后悔当初

2021-01-04 11:00

I think that the best solution is to write yourself a little function for splitting work across processes evenly enough. Here's some pseudo-code, I'm sure you can write C (is that C in your question ?) better than I can.

function split_evenly_enough(num_steps, num_processes)
    return = repmat(0, num_processes)  ! pseudo-Matlab for an array of num_processes 0s
    steps_per_process = ceiling(num_steps/num_processes)
    return = steps_per_process - 1 ! set all elements of the return vector to this number
    return(1:mod(num_steps, num_processes)) = steps_per_process  ! some processes have 1 more step
end

0 讨论(0)

一整个雨季

2021-01-04 11:00
How about this?
```
int* distribute(int total, int processes) {
    int* distribution = new int[processes];
    int last = processes - 1;        

    int remaining = total;
    int process = 0;

    while (remaining != 0) {
        ++distribution[process];
        --remaining;

        if (process != last) {
            ++process;
        }
        else {
            process = 0;
        }
    }

    return distribution;
}
```
The idea is that you assign an element to the first process, then an element to the second process, then an element to the third process, and so on, jumping back to the first process whenever the last one is reached.

This method works even when the number of processes is greater than the number of elements. It uses only very simple operations and should therefore be very fast.
0 讨论(0)
发布评论:

提交评论
- 加载中...
耶瑟儿～

2021-01-04 11:02
Consider your "1000 steps and 7 processes" example.
- simple division won't work because integer division (in C) gives you the floor, and you are left with some remainder: i.e. 1000 / 7 is 142, and there will be 6 doodads hanging out
- ceiling division has the opposite problem: ceil(1000/7) is 143, but then the last processor overruns the array, or ends up with less to do than the others.
You are asking for a scheme to evenly distribute the remainder over processors. Some processes should have 142, others 143. There must be a more formal approach but considering the attention this question's gotten in the last six months maybe not.

Here's my approach. Every process needs to do this algorithm, and just pick out the answer it needs for itself.
```
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>

int main (int argc, char ** argv)
{
#define NR_ITEMS 1000
    int i, rank, nprocs;;
    int *bins;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
    bins = calloc(nprocs, sizeof(int));

    int nr_alloced = 0;
    for (i=0; i<nprocs; i++) {
        remainder = NR_ITEMS - nr_alloced;
        buckets = (nprocs - i);
        /* if you want the "big" buckets up front, do ceiling division */
        bins[i] = remainder / buckets;
        nr_alloced += bins[i];
    }

    if (rank == 0)
        for (i=0; i<nprocs; i++) printf("%d ", bins[i]);

    MPI_Finalize();
    return 0;
}
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
悲&欢浪女

2021-01-04 11:12
I know this is long sense gone but a simple way to do this is to give each process the floor of the (number of items) / (number of processes) + (1 if process_num < num_items mod num_procs). In python, an array with work counts:
```
# Number of items
NI=128
# Number of processes
NP=20

# Items per process
[NI/NP + (1 if P < NI%NP else 0)for P in range(0,NP)]
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页