Which Java synchronization construct is likely to provide the best performance for a concurrent, iterative processing scenario with a fixed number of threads like the one outli
Update: V6 - Busy Wait, with main thread also working
An obvious improvement on V5 (busy wait for work in 7 worker threads, busy wait for completion in main thread) seemed to again split the work into 7+1 parts and to let the main thread process one part concurrently with the other worker threads (instead of just busy-waiting), and to subsequently busy-wait for the completion of all other threads' work items. That would utilize the 8th processor (in the example's 8-core configuration) and add its cycles to the available compute resource pool.
This was indeed straight-forward to implement. And the results are indeed again slightly better:
blocksize | system | user | cycles/sec
256k 1.0% 98% 1.39
64k 1.0% 98% 6.8
16k 1.0% 98% 50.4
4096 1.0% 98% 372
1024 1.0% 98% 1317
256 1.0% 98% 3546
64 1.5% 98% 9091
16 2.0% 98% 16949
So this seems to represents the best solution so far.