What is the fastest cyclic synchronization in Java (ExecutorService vs. CyclicBarrier vs. X)?

后端 未结 6 896
执笔经年
执笔经年 2021-02-02 15:45

Which Java synchronization construct is likely to provide the best performance for a concurrent, iterative processing scenario with a fixed number of threads like the one outli

相关标签:
6条回答
  • 2021-02-02 16:07

    Update: V6 - Busy Wait, with main thread also working

    An obvious improvement on V5 (busy wait for work in 7 worker threads, busy wait for completion in main thread) seemed to again split the work into 7+1 parts and to let the main thread process one part concurrently with the other worker threads (instead of just busy-waiting), and to subsequently busy-wait for the completion of all other threads' work items. That would utilize the 8th processor (in the example's 8-core configuration) and add its cycles to the available compute resource pool.

    This was indeed straight-forward to implement. And the results are indeed again slightly better:

    blocksize | system | user | cycles/sec
    256k        1.0%     98%       1.39
    64k         1.0%     98%       6.8
    16k         1.0%     98%      50.4
    4096        1.0%     98%     372
    1024        1.0%     98%    1317
    256         1.0%     98%    3546
    64          1.5%     98%    9091
    16          2.0%     98%   16949
    

    So this seems to represents the best solution so far.

    0 讨论(0)
  • 2021-02-02 16:07

    Just hit upon this thread, and even though it's almost a year old let me point you to the "jbarrier" library we developed at the University of Bonn a few months ago:

    http://net.cs.uni-bonn.de/wg/cs/applications/jbarrier/

    The barrier package targets exactly the case where the number of worker threads is <= the number of cores. The package is based on busy-wait, it supports not only barrier actions but also global reductions, and apart from a central barrier it offers tree-structured barriers for parallelizing the synchronization/reduction parts even further.

    0 讨论(0)
  • 2021-02-02 16:09

    I also wonder if you could try more than 8 threads. If your CPU supports HyperThreading then (at least in theory) you can squeeze 2 threads per core and see what comes out of it.

    0 讨论(0)
  • 2021-02-02 16:14

    It does seem that you do not need any synchronization between the workers. Maybe you should consider using the ForkJoin framework which is available in Java 7, as well as a separate library. Some links:

    • Tutorial at Oracle
    • Original paper by Doug Lea
    0 讨论(0)
  • 2021-02-02 16:16

    Update: V7 - Busy Wait that reverts to Wait/Notify

    After some playing around with V6 it turns out that the busy waits obscure the real hotspots of the application a bit when profiling. Plus, the fan on the system keeps going into overdrive even if no work items are being processed. So a further improvement was to busy wait for work items for a fixed amount of time (say, about 2 milliseconds) and then to revert to a "nicer" wait()/notify() combination. The worker threads simply publish their current wait mode to the main thread via an atomic boolean that indicates whether they are busy waiting (and hence just need a work item to be set) or whether they expect a call to notify() because they are in wait().

    Another improvement that turned out to be rather straight-forward was to let threads that have completed their primary work item repeatedly invoke a client-supplied callback while they are waiting for the other threads to complete their primary work items. That way, the wait time (which happens because threads are bound to get slightly different work loads) does not need to be completely lost to the app.

    I am still very interested in hearing from other users that encountered a similar use case.

    0 讨论(0)
  • 2021-02-02 16:28

    Update: V5 - Busy Wait in all threads (seems optimal so far)

    Since all cores are dedicated to this task, it seemed worth a try to simply eliminate all the complex synchronization constructs and do a busy wait at each synchronization point in all threads. This turns out to beat all other approaches by a wide margin.

    The setup is as follows: start with V4 above (CyclicBarrier + Busy Wait). Replace the CyclicBarrier with an AtomicInteger that the main thread resets to zero each cycle. Each worker thread Runnable that completes its work increments the atomic integer by one. The main thread busy waits:

    while( true ) {
        // busy-wait for threads to complete their work
        if( atomicInt.get() >= workerThreadCount ) break;
    }
    

    Instead of 8, only 7 worker threads are launched (since all threads, including the main thread, now load a core pretty much completely). The results are as follows:

    blocksize | system | user | cycles/sec
    256k        1.0%     98%       1.36
    64k         1.0%     98%       6.8
    16k         1.0%     98%      44.6
    4096        1.0%     98%     354
    1024        1.0%     98%    1189
    256         1.0%     98%    3222
    64          1.5%     98%    8333
    16          2.0%     98%   16129
    

    Using a wait/notify in the worker threads reduces the throughput to about 1/3rd of this solution.

    0 讨论(0)
提交回复
热议问题