I am currently working on my bachelor thesis and basically my task is to optimise a given code in Go, i.e. make it run as fast as possible. First, I opt
Why?
A single "wrong" SLOC may devastate performance into more than about +37%
or may improve performance to spend less than -57% of the baseline processing time
51.151µs on MA(200) [10000]float64 ~ 22.017µs on MA(200) [10000]int
70.325µs on MA(200) [10000]float64
Why []int
-s?
You see it on your own above - this is the bread and butter for HPC/fintech efficient sub-[us] processing strategies ( and we still speak in terms of just [SERIAL]
process scheduling ).
This one may test on any scale - but rather test first ( here ) your own implementations, on the very the same scale - MA(200) [10000]float64 setup - and post your baseline durations in [us] to view the initial process performance and to compare apples-to-apples, having the posted 51.2 [us] threshold to compare against.
Next comes the harder part:
Yes, one may go and implement a Moving Average calculation, so that it indeed proceeds through the heaps of data using some intentionally indoctrinated "just"-[CONCURRENT]
processing approach ( irrespective whether being due to some kind of error, some authority's "advice", professional blindness or just from a dual-Socrates-fair ignorance ) which obviously does not mean that the nature of the convolutional stream-processing, present inside the Moving Average mathematical formulation, has forgotten to be a pure [SERIAL]
process, just due to an attempt to enforce it get calculated inside some degree of "just"-[CONCURRENT]
processing.
( Btw. The Hard Computer-Scientists and dual-domain nerds will also object here, that Go-language is by design using best Rob Pike's skills into having a framework of concurrent coroutines, not any true-[PARALLEL]
process-scheduling, even though Hoare's CSP-tools, available in the language concept, may add some salt and pepper and introduce a stop-block type of inter-process communication tools, that will block "just"-[CONCURRENT]
code sections into some hardwired CSP-p2p-synchronisation. )
Having a poor level of performance in [SERIAL]
does not set any yardstick. Having a reasonable amount of performance tuning in single-thread, only then one may benefit from going distributed ( still having to pay additional serial costs, which makes Amdahl Law ( rather Overhead-strict-Amdahl Law ) get into the game ).
If one can introduce such low level of additional setup-overheads and still achieve any remarkable parallelism, scaled into the non-[SEQ]
part of the processing, there and only there comes a chance to increase the process effective-performance.
It is not hard to loose much more than to gain in this, so always benchmark the pure-[SEQ]
against the potential tradeoffs between a non-[SEQ] / N[PAR]_processes
theoretical, overhead-naive speedup, for which one will pay a cost of a sum of all add-on-[SEQ]
-overheads, so if and only if :
( pure-[SEQ]_processing [ns]
+ add-on-[SEQ]-setup-overheads [ns]
+ ( non-[SEQ]_processing [ns] / N[PAR]_processes )
) << ( pure-[SEQ]_processing [ns]
+ ( non-[SEQ]_processing [ns] / 1 )
)
Not having this jet-fighters advantage of both the surplus height and Sun behind you, never attempt going into any kind of HPC / parallelisation attempts - they will never pay for themselves not being remarkably <<
better, than a smart [SEQ]
-process.
One animation is worth million words.
An interactive animation even better:
So,
assume a process-under-test, which has both a [SERIAL]
and a [PARALLEL]
part of the process schedule.
Let p
be the [PARALLEL]
fraction of the process duration ~ ( 0.0 .. 1.0 )
thus the [SERIAL]
part does not last longer than ( 1 - p )
, right?
So, let's start interactive experimentation from such a test-case, where the p == 1.0
, meaning all such process duration is spent in just a [PARALLEL]
part, and the both initial serial and the terminating parts of the process-flow ( that principally are always [SERIAL]
) have zero-durations ( ( 1 - p ) == 0. )
Assume the system does no particular magic and thus needs to spend some real steps on intialisation of each of the [PARALLEL]
part, so as to run it on a different processor ( (1), 2, .., N )
, so let's add some overheads, if asked to re-organise the process flow and to marshal + distribute + un-marshal all the necessary instructions and data, so as the intended process now can start and run on N
processors in parallel.
These costs are called o
( here initially assumed for simplicity to be just constant and invariant to N
, which is not always the case in real, on silicon / on NUMA / on distributed infrastructures ).
By clicking on the Epilogue headline above, an interactive environment opens and is free for one's own experimentation.
With p == 1. && o == 0. && N > 1
the performance is steeply growing to current achievable [PARALLEL]
-hardware O/S limits for a still monolytical O/S code-execution ( where still no additional distribution costs for MPI- and similar depeche-mode distributions of work-units ( where one would immediately have to add indeed a big number of [ms]
, while our so far best just [SERIAL]
implementation has obviously did the whole job in less than just ~ 22.1 [us] ) ).
But except such artificially optimistic case, the job does not look so cheap to get efficiently parallelised.
Try having not a zero, but just about ~ 0.01% the setup overhead costs of o
, and the line starts to show some very different nature of the overhead-aware scaling for even the utmost extreme [PARALLEL]
case ( having still p == 1.0
), and having the potential speedup somewhere about just near the half of the initially super-idealistic linear speedup case.
Now, turn the p
to something closer to reality, somewhere less artificially set than the initial super-idealistic case of == 1.00
--> { 0.99, 0.98, 0.95 }
and ... bingo, this is the reality, where process-scheduling ought be tested and pre-validated.
As an example, if an overhead ( of launching + final joining a pool of coroutines ) would take more than ~ 0.1%
of the actual [PARALLEL]
processing section duration, there would be not bigger speedup of 4x ( about a 1/4 of the original duration in time ) for 5 coroutines ( having p ~ 0.95 ), not more than 10x ( a 10-times faster duration ) for 20 coroutines ( all assuming that a system has 5-CPU-cores, resp. 20-CPU-cores free & available and ready ( best with O/S-level CPU-core-affinity mapped processes / threads ) for uninterrupted serving all those coroutines during their whole life-span, so as to achieve any above expected speedups.
Not having such amount of hardware resources free and ready for all of those task-units, intended for implementing the [PARALLEL]
-part of the process-schedule, the blocking/waiting states will introduce additional absolute wait-states and the resulting-performance adds these new-[SERIAL]
-blocking/waiting sections to the overall process-duration and the initially wished-to-have speedups suddenly cease to exist and performance factor falls well under << 1.00
( meaning that the effective run-time was due to the blocking-states way slower, than the non-parallelised just-[SERIAL]
workflow ).
This may sound complicated for new keen experimentators, however we may put it in a reversed perspective. Given the whole process of distribution the intended [PARALLEL]
pool-of-tasks is known not to be shorter than, say, about a 10 [us]
, the overhead-strict graphs show, there needs to be at least about 1000 x 10 [us]
of non-blocking computing intensive processing inside the [PARALLEL]
section so as not to devastate the efficiency of the parallelised-processing.
If there is not a sufficiently "fat"-piece of processing, the overhead-costs ( going remarkably above the above cited threshold of ~ 0.1%
) then brutally devastate the net-efficiency of the successfully parallellised-processing ( but having performed at such unjustifiably high relative costs of the setup vs the limited net effects of any-N
-processors, as was demonstrated in the available live-graphs ).
There is no surprise for distributed-computing nerds, that the overhead o
comes with also additional dependencies - on N
( the more processes, the more efforts are to be spent to distribute work-packages ), on marshalled data-BLOBs' sizes ( the larger the BLOBs, the longer the MEM-/IO-devices remain blocked, before serving the next process to receive a distributed BLOB across such device/resource for each of the target 2..N
-th receiving process ), on avoided / CSP-signalled, channel-mediated inter-process coordinations ( call it additional per-incident blocking, reducing the p
further and further below the ultimately nice ideal of 1.
).
So, the real-world reality is rather very far from the initially idealised, nice and promising p
== 1.0
, ( 1 -
p
) == 0.0
and o
== 0.0
As obvious from the very beginning, try to beat rather the 22.1 [us]
[SERIAL]
threshold, than trying to beat this, while getting worse and worse, if going [PARALLEL]
where realistic overheads and scaling, using already under-performing approaches, does not help a single bit.