Predicting runtime of parallel loop using a-priori estimate of effort per iterand (for given number of workers)

后端 未结 1 473
傲寒
傲寒 2021-01-18 19:12

I\'m working on a MATLAB implementation of an adaptive Matrix-Vector Multiplication for very large sparse matrices coming from a particular discretisation of a PDE (with kno

相关标签:
1条回答
  • 2021-01-18 19:32

    I came up with a somewhat satisfactory solution, so in case anyone's interested I thought I'd share it. I would still appreciate comments on how to improve/fine-tune the approach.

    Basically, I decided that the only sensible way is to build a (very) rudimentary model of the scheduler for the parallel loop:

    function c=est_cost_para(cost_blocks,cost_it,num_cores)
    % Estimate cost of parallel computation
    
    % Inputs:
    %   cost_blocks: Estimate of cost per block in arbitrary units. For
    %       consistency with the other code this must be in the reverse order
    %       that the scheduler is fed, i.e. cost should be ascending!
    %   cost_it:     Base cost of iteration (regardless of number of entries)
    %       in the same units as cost_blocks.
    %   num_cores:   Number of cores
    %
    % Output:
    %   c: Estimated cost of parallel computation
    
    num_blocks=numel(cost_blocks);
    c=zeros(num_cores,1);
    
    i=min(num_blocks,num_cores);
    c(1:i)=cost_blocks(end-i+1:end)+cost_it;
    while i<num_blocks
        i=i+1;
        [~,i_min]=min(c); % which core finished first; is fed with next block
        c(i_min)=c(i_min)+cost_blocks(end-i+1)+cost_it;
    end
    
    c=max(c);
    
    end
    

    The parameter cost_it for an empty iteration is a crude blend of many different side effects, which could conceivably be separated: The cost of an empty iteration in a for/parfor-loop (could also be different per block), as well as the start-up time resp. transmission of data of the parfor-loop (and probably more). My main reason to throw everything together is that I don't want to have to estimate/determine the more granular costs.

    I use the above routine to determine the cut-off in the following way:

    % function i=cutoff_ser_para(cost_blocks,cost_it,num_cores)
    % Determine cut-off between serial an parallel regime
    
    % Inputs:
    %   cost_blocks: Estimate of cost per block in arbitrary units. For
    %       consistency with the other code this must be in the reverse order
    %       that the scheduler is fed, i.e. cost should be ascending!
    %   cost_it:     Base cost of iteration (regardless of number of entries)
    %       in the same units as cost_blocks.
    %   num_cores:   Number of cores
    %
    % Output:
    %   i: Number of blocks to be calculated serially
    
    num_blocks=numel(cost_blocks);
    cost=zeros(num_blocks+1,2);
    
    for i=0:num_blocks
        cost(i+1,1)=sum(cost_blocks(end-i+1:end))/num_cores + i*cost_it;
        cost(i+1,2)=est_cost_para(cost_blocks(1:end-i),cost_it,num_cores);
    end
    
    [~,i]=min(sum(cost,2));
    i=i-1;
    
    end
    

    In particular, I don't inflate/change the value of est_cost_para which assumes (aside from cost_it) the most optimistic scheduling possible. I leave it as is mainly because I don't know what would work best. To be conservative (i.e. avoid feeding too large blocks to the parallel loop), one could of course add some percentage as a buffer or even use a power > 1 to inflate the parallel cost.

    Note also that est_cost_para is called with successively less blocks (although I use the variable name cost_blocks for both routines, one is a subset of the other).

    Compared to the approach in my wordy question I see two main advantages:

    1. The relatively intricate dependence between the data (both the number of blocks as well as their cost) and the number of cores is captured much better with the simulated scheduler than would be possible with a single formula.
    2. By calculating the cost for all possible combinations of serial/parallel distribution and then taking the minimum, one cannot get "stuck" too early while reading in the data from one side (e.g. by a jump which is large relative to the data so far, but small in comparison to the total).

    Of course, the asymptotic complexity is higher by calling est_cost_para with its while-loop all the time, but in my case (num_blocks<500) this is absolutely negligible.

    Finally, if a decent value of cost_it does not readily present itself, one can try to calculate it by measuring the actual execution time of each block, as well as the purely parallel part of it, and then trying to fit the resulting data to the cost prediction and get an updated value of cost_it for the next call of the routine (by using the difference between total cost and parallel cost or by inserting a cost of zero into the fitted formula). This should hopefully "converge" to the most useful value of cost_it for the problem in question.

    0 讨论(0)
提交回复
热议问题