I understand that the new Java (8) has introduced new sychronization tools such as LongAccumulator (under the atomic package).
In the documentation it says that the Long
by this
http://codenav.org/code.html?project=/jdk/1.8.0-ea&path=/Source%20Packages/java.util.concurrent.atomic/LongAccumulator.java
it looks like a spin lock.
That's a very good question, because it shows a very important characteristic of concurrent programming with shared memory. Before going into details, I have to make a step back. Take a look at the following class:
class Accumulator {
private final AtomicLong value = new AtomicLong(0);
public void accumulate(long value) {
this.value.addAndGet(value);
}
public long get() {
return this.value.get();
}
}
If you create one instance of this class and invoke the method accumulate(1)
from one thread in a loop, then the execution will be really fast. However, if you invoke the method on the same instance from two threads, the execution will be about two magnitudes slower.
You have to take a look at the memory architecture to understand what happens. Most systems nowadays have a non-uniform memory access. In particular, each core has its own L1 cache, which is typically structured into cache lines with 64 octets. If a core executes an atomic increment operation on a memory location, it first has to get exclusive access to the corresponding cache line. That's expensive, if it has no exclusive access yet, due to the required coordination with all other cores.
There's a simple and counter-intuitive trick to solve this problem. Take a look at the following class:
class Accumulator {
private final AtomicLong[] values = {
new AtomicLong(0),
new AtomicLong(0),
new AtomicLong(0),
new AtomicLong(0),
};
public void accumulate(long value) {
int index = getMagicValue();
this.values[index % values.length].addAndGet(value);
}
public long get() {
long result = 0;
for (AtomicLong value : values) {
result += value.get();
}
return result;
}
}
At first glance, this class seems to be more expensive due to the additional operations. However, it might be several times faster than the first class, because it has a higher probability, that the executing core already has exclusive access to the required cache line.
To make this really fast, you have to consider a few more things:
long[8 * 4]
for that purpose, and only use the indexes 0
, 8
, 16
and 24
.To sum up, LongAccumulator is more efficient for some use cases, because it uses redundant memory for frequently used write operations, in order to reduce the number of times, that cache lines have to be exchange between cores. On the other hand, read operations are slightly more expensive, because they have to create a consistent result.