Java BlockingQueue latency high on Linux

后端 未结 4 1830
终归单人心
终归单人心 2021-01-30 23:03

I am using BlockingQueue:s (trying both ArrayBlockingQueue and LinkedBlockingQueue) to pass objects between different threads in an application I’m currently working on. Perform

相关标签:
4条回答
  • 2021-01-30 23:11

    Your test is not a good measure of queue handoff latency because you have a single thread reading off the queue which writes synchronously to System.out (doing a String and long concatenation while it is at it) before it takes again. To measure this properly you need to move this activity out of this thread and do as little work as possible in the taking thread.

    You'd be better off just doing the calculation (then-now) in the taker and adding the result to some other collection which is periodically drained by another thread that outputs the results. I tend to do this by adding to an appropriately presized array backed structure accessed via an AtomicReference (hence the reporting thread just has to getAndSet on that reference with another instance of that storage structure in order to grab the latest batch of results; e.g. make 2 lists, set one as active, every x s a thread wakes up and swaps the active and the passive ones). You can then report some distribution instead of every single result (e.g. a decile range) which means you don't generate vast log files with every run and get useful information printed for you.

    FWIW I concur with the times Peter Lawrey stated & if latency is really critical then you need to think about busy waiting with appropriate cpu affinity (i.e. dedicate a core to that thread)

    EDIT after Jan 6

    If I remove the call to Thread.sleep () and instead let both the producer and consumer call barrier.await() in every iteration (the consumer calls it after having printed the elapsed time to the console), the measured latency is reduced from 60 microseconds to below 10 microseconds. If running the threads on the same core, the latency gets below 1 microsecond. Can anyone explain why this reduced the latency so significantly?

    You're looking at the difference between java.util.concurrent.locks.LockSupport#park (and corresponding unpark) and Thread#sleep. Most j.u.c. stuff is built on LockSupport (often via an AbstractQueuedSynchronizer that ReentrantLock provides or directly) and this (in Hotspot) resolves down to sun.misc.Unsafe#park (and unpark) and this tends to end up in the hands of the pthread (posix threads) lib. Typically pthread_cond_broadcast to wake up and pthread_cond_wait or pthread_cond_timedwait for things like BlockingQueue#take.

    I can't say I've ever looked at how Thread#sleep is actually implemented (cos I've never come across something low latency that isn't a condition based wait) but I would imagine that it causes it to be demoted by the schedular in a more aggressive way than the pthread signalling mechanism and that is what accounts for the latency difference.

    0 讨论(0)
  • 2021-01-30 23:12

    @Peter Lawrey

    Certain operations use OS calls (such as locking/cyclic barriers)

    Those are NOT OS (kernel) calls. Implemented via simple CAS (which on x86 comes w/ free memory fence as well)

    One more: dont use ArrayBlockingQueue unless you know why (you use it).

    @OP: Look at ThreadPoolExecutor, it offers excellent producer/consumer framework.

    Edit below:

    to reduce the latency (baring the busy wait), change the queue to SynchronousQueue add the following like before starting the consumer

    ...
    consumerThread.setPriority(Thread.MAX_PRIORITY);
    consumerThread.start();
    

    This is the best you can get.


    Edit2: Here w/ sync. queue. And not printing the results.

    package t1;
    
    import java.math.BigDecimal;
    import java.util.concurrent.CountDownLatch;
    import java.util.concurrent.SynchronousQueue;
    
    public class QueueTest {
    
        static final int RUNS = 250000;
    
        final SynchronousQueue<Long> queue = new SynchronousQueue<Long>();
    
        int sleep = 1000;
    
        long[] results  = new long[0];
        public void start(final int runs) throws Exception {
            results = new long[runs];
            final CountDownLatch barrier = new CountDownLatch(1);
            Thread consumerThread = new Thread(new Runnable() {
                @Override
                public void run() {
                    barrier.countDown();
                    try {
    
                        for(int i = 0; i < runs; i++) {                        
                            results[i] = consume(); 
    
                        }
                    } catch (Exception e) {
                        return;
                    } 
                }
            });
            consumerThread.setPriority(Thread.MAX_PRIORITY);
            consumerThread.start();
    
    
            barrier.await();
            final long sleep = this.sleep;
            for(int i = 0; i < runs; i++) {
                try {                
                    doProduce(sleep);
    
                } catch (Exception e) {
                    return;
                }
            }
        }
    
        private void doProduce(final long sleep) throws InterruptedException {
            produce();
        }
    
        public void produce() throws InterruptedException {
            queue.put(new Long(System.nanoTime()));//new Long() is faster than value of
        }
    
        public long consume() throws InterruptedException {
            long t = queue.take();
            long now = System.nanoTime();
            return now-t;
        }
    
        public static void main(String[] args) throws Throwable {           
            QueueTest test = new QueueTest();
            System.out.println("Starting + warming up...");
            // Run first once, ignoring results
            test.sleep = 0;
            test.start(15000);//10k is the normal warm-up for -server hotspot
            // Run again, printing the results
            System.gc();
            System.out.println("Starting again...");
            test.sleep = 1000;//ignored now
            Thread.yield();
            test.start(RUNS);
            long sum = 0;
            for (long elapsed: test.results){
                sum+=elapsed;
            }
            BigDecimal elapsed = BigDecimal.valueOf(sum, 3).divide(BigDecimal.valueOf(test.results.length), BigDecimal.ROUND_HALF_UP);        
            System.out.printf("Avg: %1.3f micros%n", elapsed); 
        }
    }
    
    0 讨论(0)
  • 2021-01-30 23:12

    If latency is critical and you do not require strict FIFO semantics, then you may want to consider JSR-166's LinkedTransferQueue. It enables elimination so that opposing operations can exchange values instead of synchronizing on the queue data structure. This approach helps reduce contention, enables parallel exchanges, and avoids thread sleep/wake-up penalties.

    0 讨论(0)
  • 2021-01-30 23:26

    I would use just an ArrayBlockingQueue if you can. When I have used it the latency was between 8-18 microseconds on Linux. Some point of note.

    • The cost is largely the time it takes to wake up the thread. When you wake up a thread its data/code won't be in cache so you will find that if you time what happens after a thread has woken that can take 2-5x longer than if you were to run the same thing repeatedly.
    • Certain operations use OS calls (such as locking/cyclic barriers) these are often more expensive in a low latency scenario than busy waiting. I suggest trying to busy wait your producer rather than use a CyclicBarrier. You could busy wait your consumer as well but this could be unreasonably expensive on a real system.
    0 讨论(0)
提交回复
热议问题