What is the fastest cyclic synchronization in Java (ExecutorService vs. CyclicBarrier vs. X)?

五迷三道 提交于 2019-12-02 21:59:28

It does seem that you do not need any synchronization between the workers. Maybe you should consider using the ForkJoin framework which is available in Java 7, as well as a separate library. Some links:

Update: V6 - Busy Wait, with main thread also working

An obvious improvement on V5 (busy wait for work in 7 worker threads, busy wait for completion in main thread) seemed to again split the work into 7+1 parts and to let the main thread process one part concurrently with the other worker threads (instead of just busy-waiting), and to subsequently busy-wait for the completion of all other threads' work items. That would utilize the 8th processor (in the example's 8-core configuration) and add its cycles to the available compute resource pool.

This was indeed straight-forward to implement. And the results are indeed again slightly better:

blocksize | system | user | cycles/sec
256k        1.0%     98%       1.39
64k         1.0%     98%       6.8
16k         1.0%     98%      50.4
4096        1.0%     98%     372
1024        1.0%     98%    1317
256         1.0%     98%    3546
64          1.5%     98%    9091
16          2.0%     98%   16949

So this seems to represents the best solution so far.

Update: V5 - Busy Wait in all threads (seems optimal so far)

Since all cores are dedicated to this task, it seemed worth a try to simply eliminate all the complex synchronization constructs and do a busy wait at each synchronization point in all threads. This turns out to beat all other approaches by a wide margin.

The setup is as follows: start with V4 above (CyclicBarrier + Busy Wait). Replace the CyclicBarrier with an AtomicInteger that the main thread resets to zero each cycle. Each worker thread Runnable that completes its work increments the atomic integer by one. The main thread busy waits:

while( true ) {
    // busy-wait for threads to complete their work
    if( atomicInt.get() >= workerThreadCount ) break;
}

Instead of 8, only 7 worker threads are launched (since all threads, including the main thread, now load a core pretty much completely). The results are as follows:

blocksize | system | user | cycles/sec
256k        1.0%     98%       1.36
64k         1.0%     98%       6.8
16k         1.0%     98%      44.6
4096        1.0%     98%     354
1024        1.0%     98%    1189
256         1.0%     98%    3222
64          1.5%     98%    8333
16          2.0%     98%   16129

Using a wait/notify in the worker threads reduces the throughput to about 1/3rd of this solution.

I also wonder if you could try more than 8 threads. If your CPU supports HyperThreading then (at least in theory) you can squeeze 2 threads per core and see what comes out of it.

Update: V7 - Busy Wait that reverts to Wait/Notify

After some playing around with V6 it turns out that the busy waits obscure the real hotspots of the application a bit when profiling. Plus, the fan on the system keeps going into overdrive even if no work items are being processed. So a further improvement was to busy wait for work items for a fixed amount of time (say, about 2 milliseconds) and then to revert to a "nicer" wait()/notify() combination. The worker threads simply publish their current wait mode to the main thread via an atomic boolean that indicates whether they are busy waiting (and hence just need a work item to be set) or whether they expect a call to notify() because they are in wait().

Another improvement that turned out to be rather straight-forward was to let threads that have completed their primary work item repeatedly invoke a client-supplied callback while they are waiting for the other threads to complete their primary work items. That way, the wait time (which happens because threads are bound to get slightly different work loads) does not need to be completely lost to the app.

I am still very interested in hearing from other users that encountered a similar use case.

Just hit upon this thread, and even though it's almost a year old let me point you to the "jbarrier" library we developed at the University of Bonn a few months ago:

http://net.cs.uni-bonn.de/wg/cs/applications/jbarrier/

The barrier package targets exactly the case where the number of worker threads is <= the number of cores. The package is based on busy-wait, it supports not only barrier actions but also global reductions, and apart from a central barrier it offers tree-structured barriers for parallelizing the synchronization/reduction parts even further.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!