What is the difference between Work, Span and Time in parallel algorithm analysis?

让人想犯罪 __ 提交于 2019-12-04 15:13:54

Origin:

PRAMs have been introduced in early 70-ies last century, in a hope it may allow to jump ahead a performance in tackling computationally hard problems. Yet, the promises or better expectations were cooled down by principal limitations these computing-device architectures still have to live with.


Theory:

T1 = amount of time for a processing, measured once execution done in a pure-[SERIAL] schedule.

T = amount of time for processing a computing-graph ( Directed, just hopefullyoften forgotten finally Acyclic, Graph ) once execution done in a "just"-[CONCURRENT] manner, but having an indeed infinite amount of real-resources available all that time, thus allowing for any degree parallelism to actually but just incidentally take place.

( WARNING NOTICE: your professors need not enjoy this interpretation, but reality rules -- infinite processors are simply not enough, as any other resource must also be present in infinite amounts & capabilities, be it RAM-accesses, IO-s, sensorics et al, so that all these must provide infinitely parallel services, avoiding any kind of blocking / waiting / re-scheduling that might have appeared due to any kind of any resource temporal / contextual in-ability to serve as asked, under an infinitely parallel amount of such service-requests, and answer "immediately" ).


How to tackle:

T1 for the above posted problem has imperatively ordered two O(N) blocks - memory allocations for M[:] and final search for Max over M[:], and two O(N2) blocks, processing "all pairs" (i,j) over a domain of N-by-N values.

Based on an assumption of a CIS/RIS homogenity, this Work will be no less than ~ 2N(1+N)

For T there would be more things to do. First, detect what potential parallel code-execution paths may happen, next, also protecting the results from being "overwritten" in colliding moments - your headline just slightly mentions CRCW - a weak assumption to analyse the latter problem for a Concurrent-Read-Concurrent-Write PRAM-machine.

Do not hesitate to take a pencil, paper and draw the D(jh)AG for the smallest possible N == 2 ( 3, if you have a bit larger paper ), where one can derive the flow of operation ( and potentially (in)-dependency ordering for operations in case of a less forgiving but the more realistic CREW or EREW type of the PRAM-computing-devices ).


Criticism: an indeed Devil's part of the Lesson your professors will like the least

Any careful kind reader has already noted several nontrivial assumptions, a homogenity of CIS/RIS instruction durations being one minor case of these.

The biggest, yet hidden part of the problem, is the actual cost of process-scheduling. A pure-[SERIAL] code execution enjoys ( an unfair ) advantage of having zero add-on overhead costs ( plus on many silicon architectures, there are additional performance tricks arriving from out-of-order instruction re-ordered execution, ref. superscalar pipelined or VLIW architectures for indepth details ), while any sort of real-world process-scheduling principally adds additional overhead costs, that were not present in a pure-[SERIAL] code-execution case for getting the T1.

On real-world systems, having both NUMA impacted and non-homogeneous CIS/RIS instruction durations impacted remarkable code-execution flow durations' irregularities, these add-on overhead costs indeed dramatically shift the baseline for any speedup comparison.

The Devil's Question:

Where do we account for these real-world add-on costs?

In real life we do.

In original Amdahl's Law formulation and in the Brent's Theorem, we did not.

The re-formulated Amdahl's Law inputs also these { initial | coordination | terminal}-add-on overhead costs and suddenly the computed and experimentally validated speed-ups start to match the observations experienced on the commonly operated real-world computing fabrics.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!