I\'ve had this question for quite a while now, trying to read lots of resources and understanding what is going on - but I\'ve still failed to get a good understanding of why th
First of all the code you are writing is java which will create java byte code which translates to different atomic operations on different instruction sets (Arm vs powerpc vs X86...) which can behave different on implementations by different vendors and even between architectures from the same vendor (such as intel core 2 duo and skylake). So its really hard to answer your question!
This paper states that for the tested X86 architectures one execution of any atomic operation perform similarly (very small differnce between CAS, Fetch and add, swap) while CAS may fail and need to be executed multiple times. In case of one thread it will never fail however.
This stackoverflow post states:
Each object has a monitor associated with it. The thread that executes monitorenter gains ownership of the monitor associated with objectref. If another thread already owns the monitor associated with objectref, the current thread waits until the object is unlocked, then tries again to gain ownership. If the current thread already owns the monitor associated with objectref, it increments a counter in the monitor indicating the number of times this thread has entered the monitor. If the monitor associated with objectref is not owned by any thread, the current thread becomes the owner of the monitor, setting the entry count of this monitor to 1.
Lets look at the necessary operations in case of CAS:
public final int updateAndGet(IntUnaryOperator updateFunction) {
int prev, next;
do {
prev = get();
next = updateFunction.applyAsInt(prev);
} while (!compareAndSet(prev, next));
return next;
}
Fetch x, multiply x, Cas on x, check if cas succeeded
Now this is efficient in a non contented case because there is a minimal number of operations needed. But in case the cacheline is contended it is not very efficient because all threads are actively spinning while most of them fail. Furthermore I remember that spinning on a contended cacheline with an atomic operation is very expensive.
Now the important part in synchronize is:
If another thread already owns the monitor associated with objectref, the current thread waits until the object is unlocked
It depends on how this waiting is implemented.
the synchronized method could put the thread to sleep for a random time after failing to acquire the monitor, Also instead of using an atomic operation to check if the monitor is free it can do it with a simple read (this is quicker but I couldn't find a link to prove it).
My bet is that the waiting in synchronized is implemented in a smart way and optimized for situations with contention with one of the methods above or something similar and therefore it is quicker in the contended scenario.
The tradeof is that in non contended situations it is slower.
I still admit I have no proof.
You should read, re-read, and accept @Holger's excellent answer, as the insights it provides are far more valuable than a single set of benchmark numbers from one developer's workstation.
I tweaked your benchmarks to make them a bit more apples-to-apples, but if you read @Holger's answer, you'll see why this isn't a terribly useful test. I'm going to include my changes and my results simply to show how results can vary from one machine (or one JRE version) to another.
First, my version of benchmarks:
@State(Scope.Benchmark)
public class SandBox {
public static void main(String[] args) throws RunnerException {
new Runner(
new OptionsBuilder().include(SandBox.class.getSimpleName())
.shouldFailOnError(true)
.mode(Mode.AverageTime)
.timeUnit(TimeUnit.NANOSECONDS)
.warmupIterations(5)
.warmupTime(TimeValue.seconds(5))
.measurementIterations(5)
.measurementTime(TimeValue.seconds(5))
.threads(-1)
.build()
).run();
}
private long number = 0xCAFEBABECAFED00DL;
private final Object lock = new Object();
private final AtomicLong atomicNumber = new AtomicLong(number);
@Setup(Level.Iteration)
public void setUp() {
number = 0xCAFEBABECAFED00DL;
atomicNumber.set(number);
}
@Fork(1)
@Benchmark
@CompilerControl(CompilerControl.Mode.DONT_INLINE)
public long casShared() {
return atomicNumber.updateAndGet(x -> x * 123L);
}
@Fork(1)
@Benchmark
@CompilerControl(CompilerControl.Mode.DONT_INLINE)
public long syncShared() {
synchronized (lock) {
return number *= 123L;
}
}
@Fork(value = 1, jvmArgsAppend = "-XX:-UseBiasedLocking")
@Benchmark
@CompilerControl(CompilerControl.Mode.DONT_INLINE)
public long syncSharedNonBiased() {
synchronized (lock) {
return number *= 123L;
}
}
}
And then my first batch of results:
# VM version: JDK 1.8.0_60, VM 25.60-b23
Benchmark Mode Cnt Score Error Units
SandBox.casShared avgt 5 976.215 ± 167.865 ns/op
SandBox.syncShared avgt 5 1820.554 ± 91.883 ns/op
SandBox.syncSharedNonBiased avgt 5 1996.305 ± 124.681 ns/op
Recall that you saw synchronized
coming out ahead under high contention. On my workstation, the atomic version fared better. If you use my version of your benchmarks, what results do you see? It won't surprise me in the least if they're substantially different.
Here's another set run under a months-old Java 9 EA release:
# VM version: JDK 9-ea, VM 9-ea+170
Benchmark Mode Cnt Score Error Units
SandBox.casShared avgt 5 979.615 ± 135.495 ns/op
SandBox.syncShared avgt 5 1426.042 ± 52.971 ns/op
SandBox.syncSharedNonBiased avgt 5 1649.868 ± 48.410 ns/op
The difference is less dramatic here. It's not terribly unusual to see a difference across major JRE versions, but who's to say you won't see them across minor releases too?
At the end of the day, the results are close. Very close. The performance of synchronized
has come a long way since the early Java versions. If you are not writing HFT algorithms or something else that's incredibly latency sensitive, you should prefer the solution that's most easily proven correct. It is generally easier to reason about synchronized
than lock-free algorithms and data structures. If you cannot demonstrate a measurable difference in your application, then synchronized
is what you should use.
The main misconception is the assumption that you are comparing “CAS
vs. synchronized
”. Given, how sophisticated JVMs implement synchronized
, you are comparing the performance of a CAS
-based algorithm using AtomicLong
with the performance of the CAS
-based algorithm used to implement synchronized
.
Similar to Lock
, the internal information for an object monitor basically consist of an int
status telling whether it has been owned and how often it is nested, a reference to the current owner thread and a queue of threads waiting to be able to acquire it. The expensive aspect is the waiting queue. Putting a thread into the queue, removing it from thread scheduling, and eventually waking it up when the current owner releases the monitor, are operations that can take a significant time.
However, in the uncontended case, the waiting queue is, of course, not involved. Acquiring the monitor consist of a single CAS
to change the status from “unowned” (usually zero) to “owned, acquired once” (guess the typical value). If successful, the thread can proceed with the critical action, followed by a release which implies just writing the “unowned” state with the necessary memory visibility and waking up another blocked thread, if there is one.
Since the wait queue is the significantly more expensive thing, implementations usually try to avoid it even in the contended case by performing some amount of spinning, making several repeated CAS
attempts before falling back to enqueuing the thread. If the critical action of the owner is as simple as a single multiplication, chances are high that the monitor will be released during the spinning phase already. Note that synchronized
is “unfair”, allowing a spinning thread to proceed immediately, even if there are already enqueued threads waiting far longer.
If you compare the fundamental operations performed by the synchronized(lock){ n = n * 123; }
when no queuing is involved and by al.updateAndGet(x -> x * 123);
, you’ll notice that they are roughly on par. The main difference is that the AtomicLong
approach will repeat the multiplication on contention while for the synchronized
approach, there is a risk of being put into the queue if no progress has been made during spinning.
But synchronized
allows lock coarsening for code repeatedly synchronizing on the same object, which might be relevant for a benchmark loop calling the syncShared
method. Unless there’s also a way to fuse multiple CAS
updates of an AtomicLong
, this can give synchronized
a dramatic advantage. (See also this article covering several aspects discussed above)
Note that due to the “unfair” nature of synchronized
, creating far more threads than CPU cores doesn’t have to be a problem. In the best case, “number of threads minus number of cores” threads end up on the queue, never waking up, while the remaining threads succeed in the spinning phase, one thread on each core. But likewise, threads not running on a CPU core can’t reduce the performance of the AtomicLong
update as they can neither, invalidate the current value to other threads nor make a failed CAS
attempt.
In either case, when CAS
ing on the member variable of an unshared object or when performing synchronized
on an unshared object, the JVM may detect the local nature of the operation and elide most of the associated costs. But this may depend on several subtle environmental aspects.
The bottom line is that there is no easy decision between atomic updates and synchronized
blocks. Things get far more interesting with more expensive operations, which may raise the likelihood of threads getting enqueued in the contended case of synchronized
, which may make it acceptable that the operation has to be repeated in the contended case of an atomic update.
Note that CAS can give you more fine-grained ordering (non-)guarantees than a synchronized block, especially with java-9 varhandles which provide ordering options aligned with the C++11 memory model.
If all you want to do is some statistics-keeping from multiple threads then a read-compute-update loop with the most relaxed memory orderings available (plain read; plain and weak CAS) may perform better on weakly ordered platforms since it won't need any barriers and the cas won't have to do wasteful internal looping if it's implemented on top of LL/SC. Additionally it will also give the JITs more freedom to reorder the instructions around those atomics. compareAndExchange
can eliminate an additional read on loop repetition.
Another complication is also how you measure performance. All of the implementations should have progress-guarantees, i.e. even under contention at least one can finish at a time. So in principle you could be wasting CPU cycles on multiple threads attempting to update your variable concurrently but still be better on the measure of 99th percentile latency because atomic operations won't resort to descheduling the thread and worse on worst-case latency because they're not fair. So just measuring averages might not tell the whole story here.