I\'ve had this question for quite a while now, trying to read lots of resources and understanding what is going on - but I\'ve still failed to get a good understanding of why th
First of all the code you are writing is java which will create java byte code which translates to different atomic operations on different instruction sets (Arm vs powerpc vs X86...) which can behave different on implementations by different vendors and even between architectures from the same vendor (such as intel core 2 duo and skylake). So its really hard to answer your question!
This paper states that for the tested X86 architectures one execution of any atomic operation perform similarly (very small differnce between CAS, Fetch and add, swap) while CAS may fail and need to be executed multiple times. In case of one thread it will never fail however.
This stackoverflow post states:
Each object has a monitor associated with it. The thread that executes monitorenter gains ownership of the monitor associated with objectref. If another thread already owns the monitor associated with objectref, the current thread waits until the object is unlocked, then tries again to gain ownership. If the current thread already owns the monitor associated with objectref, it increments a counter in the monitor indicating the number of times this thread has entered the monitor. If the monitor associated with objectref is not owned by any thread, the current thread becomes the owner of the monitor, setting the entry count of this monitor to 1.
Lets look at the necessary operations in case of CAS:
public final int updateAndGet(IntUnaryOperator updateFunction) {
int prev, next;
do {
prev = get();
next = updateFunction.applyAsInt(prev);
} while (!compareAndSet(prev, next));
return next;
}
Fetch x, multiply x, Cas on x, check if cas succeeded
Now this is efficient in a non contented case because there is a minimal number of operations needed. But in case the cacheline is contended it is not very efficient because all threads are actively spinning while most of them fail. Furthermore I remember that spinning on a contended cacheline with an atomic operation is very expensive.
Now the important part in synchronize is:
If another thread already owns the monitor associated with objectref, the current thread waits until the object is unlocked
It depends on how this waiting is implemented.
the synchronized method could put the thread to sleep for a random time after failing to acquire the monitor, Also instead of using an atomic operation to check if the monitor is free it can do it with a simple read (this is quicker but I couldn't find a link to prove it).
My bet is that the waiting in synchronized is implemented in a smart way and optimized for situations with contention with one of the methods above or something similar and therefore it is quicker in the contended scenario.
The tradeof is that in non contended situations it is slower.
I still admit I have no proof.