What is the difference between Work, Span and Time in parallel algorithm analysis?

When analysing parallel algorithms, we tend to focus Work(T1), Span(T∞) or time.

What I'm confused about is that if I was given an algorithm to analyse, what key hints would I need to look for, for Work, span and time?

Suppose this algorithm:

How do I analyse the above algorithm to find the Work, Time and span?

Origin:

PRAMs have been introduced in early 70-ies last century, in a hope it may allow to jump ahead a performance in tackling computationally hard problems. Yet, the promises or better expectations were cooled down by principal limitations these computing-device architectures still have to live with.

Theory:

T₁ = amount of time for a processing, measured once execution done in a pure-[SERIAL] schedule.

T_∞ = amount of time for processing a computing-graph ( Directed, just hopefully_{often forgotten} finally Acyclic, Graph ) once execution done in a "just"-[CONCURRENT] manner, but having an indeed infinite amount of real-resources available all that time, thus allowing for any degree parallelism to actually_{but just incidentally}take place.

_{( WARNING NOTICE: your professors need not enjoy this interpretation, but reality rules -- infinite processors are simply not enough, as any other resource must also be present in infinite amounts & capabilities, be it RAM-accesses, IO-s, sensorics et al, so that all these must provide infinitely parallel services, avoiding any kind of blocking / waiting / re-scheduling that might have appeared due to any kind of any resource temporal / contextual in-ability to serve as asked, under an infinitely parallel amount of such service-requests, and answer "immediately" )}.

How to tackle:

T₁ for the above posted problem has imperatively ordered two O(N) blocks - memory allocations for M[:] and final search for Max over M[:], and two O(N²) blocks, processing "all pairs" (i,j) over a domain of N-by-N values.

Based on an assumption of a CIS/RIS homogenity, this Work will be no less than ~ 2N(1+N)

For T_∞ there would be more things to do. First, detect what potential parallel code-execution paths may happen, next, also protecting the results from being "overwritten" in colliding moments - your headline just slightly mentions CRCW - a weak assumption to analyse the latter problem for a Concurrent-Read-Concurrent-Write PRAM-machine.

Do not hesitate to take a pencil, paper and draw the D(jh)AG for the smallest possible N == 2 ( 3, if you have a bit larger paper ), where one can derive the flow of operation ( and potentially (in)-dependency ordering for operations in case of a less forgiving but the more realistic CREW or EREW type of the PRAM-computing-devices ).

Criticism: _{an indeed Devil's part of the Lesson your professors will like the least}

Any careful kind reader has already noted several nontrivial assumptions, a homogenity of CIS/RIS instruction durations being one minor case of these.

The biggest, yet hidden part of the problem, is the actual cost of process-scheduling. A pure-[SERIAL] code execution enjoys ( an unfair ) advantage of having zero add-on overhead costs ( plus on many silicon architectures, there are additional performance tricks arriving from out-of-order instruction re-ordered execution, ref. superscalar pipelined or VLIW architectures for indepth details ), while any sort of real-world process-scheduling principally adds additional overhead costs, that were not present in a pure-[SERIAL] code-execution case for getting the T₁.

On real-world systems, having both NUMA impacted and non-homogeneous CIS/RIS instruction durations impacted remarkable code-execution flow durations' irregularities, these add-on overhead costs indeed dramatically shift the baseline for any speedup comparison.

The Devil's Question:

Where do we account for these real-world add-on costs?

In real life we do.

In original Amdahl's Law formulation and in the Brent's Theorem, we did not.

The re-formulated Amdahl's Law inputs also these { initial | coordination | terminal}-add-on overhead costs and suddenly the computed and experimentally validated speed-ups start to match the observations experienced on the commonly operated real-world computing fabrics.

来源：https://stackoverflow.com/questions/47946676/what-is-the-difference-between-work-span-and-time-in-parallel-algorithm-analysi

标签

algorithm

parallel-processing

time-complexity

analysis

asymptotic-complexity