Java - multithreaded code does not run faster on more cores

后端 未结 5 1644
半阙折子戏
半阙折子戏 2021-02-10 22:59

I was just running some multithreaded code on a 4-core machine in the hopes that it would be faster than on a single-core machine. Here\'s the idea: I got a fixed number of thre

相关标签:
5条回答
  • 2021-02-10 23:07

    Here is a not tested SpinBarrier but it should work.

    Check if that may have any improvement on the case. Since you run the code in loop extra sync only hurt performance if you have the cores on idle. Btw, I still believe you have a bug in the calc, memory intense operation. Can you tell what CPU+OS you use.

    Edit, forgot the version out.

    import java.util.concurrent.atomic.AtomicInteger;
    
    public class SpinBarrier {
        final int permits;
        final AtomicInteger count;
        final AtomicInteger version;
        public SpinBarrier(int count){ 
            this.count = new AtomicInteger(count);
            this.permits= count;
            this.version = new AtomicInteger();
        }
    
        public void await(){        
            for (int c = count.decrementAndGet(), v = this.version.get(); c!=0 && v==version.get(); c=count.get()){
                spinWait();
            }       
            if (count.compareAndSet(0, permits)){;//only one succeeds here, the rest will lose the CAS
                this.version.incrementAndGet();
            }
        }
    
        protected void spinWait() {
        }
    }
    
    0 讨论(0)
  • 2021-02-10 23:09

    synchronizing across cores is much slower than syncing on a single core

    because on a single cored machine the JVM doesn't flush the cache (a very slow operation) during each sync

    check out this blog post

    0 讨论(0)
  • 2021-02-10 23:12

    The code inside runnable does not actually do anything.
    In your specific example of 4 threads each thread will sleep for 2.5 seconds and wait for the others via the barier.
    So all that is happening is that each thread gets on the processor to increment i and then blocks for sleep leaving processor available.
    I do not see why the scheduler would alocate each thread to a separate core since all that is happening is that the threads mostly wait.
    It is fair and reasonable to expect to just to use the same core and switch among threads
    UPDATE
    Just saw that you updated post saying that some work is happening in the loop. What is happening though you do not say.

    0 讨论(0)
  • 2021-02-10 23:14

    You're sleeping nano-seconds instead of milli-seconds.

    I changed from

    Thread.sleep(0, 100000 / numberOfThreads); // sleep 0.025 ms for 4 threads
    

    to

    Thread.sleep(100000 / numberOfThreads);
    

    and got a speed-up proportional to the number of threads started just as expected.


    I invented a CPU-intensive "countPrimes". Full test code available here.

    I get the following speed-up on my quad-core machine:

    4 threads: 1625
    1 thread: 3747
    

    (the CPU-load monitor indeed shows that 4 course are busy in the former case, and that 1 core is busy in the latter case.)

    Conclusion: You're doing comparatively small portions of work in each thread between synchronization. The synchronization takes much much more time than the actual CPU-intensive computation work.

    (Also, if you have memory intensive code, such as tons of array-accesses in the threads, the CPU won't be the bottle-neck anyway, and you won't see any speed-up by splitting it on multiple CPUs.)

    0 讨论(0)
  • 2021-02-10 23:32

    Adding more threads is not necessarily guarenteed to improve performance. There are a number of possible causes for decreased performance with additional threads:

    • Coarse-grained locking may overly serialize execution - that is, a lock may result in only one thread running at a time. You get all the overhead of multiple threads but none of the benefits. Try to reduce how long locks are held.
    • The same applies to overly frequent barriers and other synchronization structures. If the inner j loop completes quickly, you might spend most of your time in the barrier. Try to do more work between synchronization points.
    • If your code runs too quickly, there may be no time to migrate threads to other CPU cores. This usually isn't a problem unless you create a lot of very short-lived threads. Using thread pools, or simply giving each thread more work can help. If your threads run for more than a second or so each, this is unlikely to be a problem.
    • If your threads are working on a lot of shared read/write data, cache line bouncing may decrease performance. That said, although this often results in performance degradation, this alone is unlikely to result in performance worse than the single threaded case. Try to make sure the data that each thread writes is separated from other threads' data by the size of a cache line (usually around 64 bytes). In particular, don't have output arrays laid out like [thread A, B, C, D, A, B, C, D ...]

    Since you haven't shown your code, I can't really speak in any more detail here.

    0 讨论(0)
提交回复
热议问题