When analysing parallel algorithms, we tend to focus Work(T1), Span(T∞) or time.
What I'm confused about is that if I was given an algorithm to analyse, what key hints would I need to look for, for Work, span and time?
How do I analyse the above algorithm to find the Work, Time and span?
Origin:
PRAMs have been introduced in early 70-ies last century, in a hope it may allow to jump ahead a performance in tackling computationally hard problems. Yet, the promises or better expectations were cooled down by principal limitations these computing-device architectures still have to live with.
Theory:
T1 = amount of time for a processing, measured once execution done in a pure-[SERIAL]
schedule.
T∞ = amount of time for processing a computing-graph ( Directed, just hopefullyoften forgotten finally Acyclic, Graph ) once execution done in a "just"-[CONCURRENT]
manner, but having an indeed infinite amount of real-resources available all that time, thus allowing for any degree parallelism to actually but just incidentally take place.
( WARNING NOTICE: your professors need not enjoy this interpretation, but reality rules -- infinite processors are simply not enough, as any other resource must also be present in infinite amounts & capabilities, be it RAM-accesses, IO-s, sensorics et al, so that all these must provide infinitely parallel services, avoiding any kind of blocking / waiting / re-scheduling that might have appeared due to any kind of any resource temporal / contextual in-ability to serve as asked, under an infinitely parallel amount of such service-requests, and answer "immediately" ).
How to tackle:
T1 for the above posted problem has imperatively ordered two O(N) blocks - memory allocations for M[:]
and final search for Max
over M[:]
, and two O(N2) blocks, processing "all pairs" (i,j)
over a domain of N
-by-N
values.
Based on an assumption of a CIS/RIS homogenity, this Work will be no less than ~ 2N(1+N)
For T∞ there would be more things to do. First, detect what potential parallel code-execution paths may happen, next, also protecting the results from being "overwritten" in colliding moments - your headline just slightly mentions CRCW - a weak assumption to analyse the latter problem for a Concurrent-Read-Concurrent-Write PRAM-machine.
Do not hesitate to take a pencil, paper and draw the D(jh)AG for the smallest possible N == 2
( 3
, if you have a bit larger paper ), where one can derive the flow of operation ( and potentially (in)-dependency ordering for operations in case of a less forgiving but the more realistic CREW or EREW type of the PRAM-computing-devices ).
Criticism: an indeed Devil's part of the Lesson your professors will like the least
Any careful kind reader has already noted several nontrivial assumptions, a homogenity of CIS/RIS instruction durations being one minor case of these.
The biggest, yet hidden part of the problem, is the actual cost of process-scheduling. A pure-[SERIAL]
code execution enjoys ( an unfair ) advantage of having zero add-on overhead costs ( plus on many silicon architectures, there are additional performance tricks arriving from out-of-order instruction re-ordered execution, ref. superscalar pipelined or VLIW architectures for indepth details ), while any sort of real-world process-scheduling principally adds additional overhead costs, that were not present in a pure-[SERIAL]
code-execution case for getting the T1.
On real-world systems, having both NUMA impacted and non-homogeneous CIS/RIS instruction durations impacted remarkable code-execution flow durations' irregularities, these add-on overhead costs indeed dramatically shift the baseline for any speedup comparison.
The Devil's Question:
Where do we account for these real-world add-on costs?
In real life we do.
In original Amdahl's Law formulation and in the Brent's Theorem, we did not.
The re-formulated Amdahl's Law inputs also these { initial | coordination | terminal}
-add-on overhead costs and suddenly the computed and experimentally validated speed-ups start to match the observations experienced on the commonly operated real-world computing fabrics.
来源:https://stackoverflow.com/questions/47946676/what-is-the-difference-between-work-span-and-time-in-parallel-algorithm-analysi