I\'ve had this question for quite a while now, trying to read lots of resources and understanding what is going on - but I\'ve still failed to get a good understanding of why th
Note that CAS can give you more fine-grained ordering (non-)guarantees than a synchronized block, especially with java-9 varhandles which provide ordering options aligned with the C++11 memory model.
If all you want to do is some statistics-keeping from multiple threads then a read-compute-update loop with the most relaxed memory orderings available (plain read; plain and weak CAS) may perform better on weakly ordered platforms since it won't need any barriers and the cas won't have to do wasteful internal looping if it's implemented on top of LL/SC. Additionally it will also give the JITs more freedom to reorder the instructions around those atomics. compareAndExchange
can eliminate an additional read on loop repetition.
Another complication is also how you measure performance. All of the implementations should have progress-guarantees, i.e. even under contention at least one can finish at a time. So in principle you could be wasting CPU cycles on multiple threads attempting to update your variable concurrently but still be better on the measure of 99th percentile latency because atomic operations won't resort to descheduling the thread and worse on worst-case latency because they're not fair. So just measuring averages might not tell the whole story here.