Observe the following definition of a thread subclass (the entire runnable Java source file is included at the end of the question for your convenience):
final c
I believe you need to reduce your code so its not doing lots of incidental things which could be confusing matters. After reducing the code it is clear to me that you are only accessing the same array location every time. i.e. position 512.
If you minimise your code, reuse your threads so you are not stop/starting them you get much more reproducible results.
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
public class MultiStackJavaExperiment {
static final int size = Integer.getInteger("size", 500000000);
public static void main(String... args) throws ExecutionException, InterruptedException {
int par = 8;
for (int s = 64; s <= 64 * 1024; s *= 2) {
int times = args.length == 0 ? 1 : Integer.parseInt(args[0]);
long[] measurements = new long[times];
ExecutorService es = Executors.newFixedThreadPool(par);
List<Future<?>> futures = new ArrayList<Future<?>>(times);
for (int i = 0; i < times; i++) {
long start = System.currentTimeMillis();
final int sz = size / par;
futures.clear();
for (int j = 0; j < par; j++) {
final Object[] arr = new Object[s];
futures.add(es.submit(new Runnable() {
@Override
public void run() {
final int bits = 7, arraySize = 1 << bits;
int i = 0;
int pos = 32;
Object v = new Object();
while (i < sz) {
if (i % 2 == 0) {
arr[pos] = v;
pos += 1;
} else {
pos -= 1;
v = arr[pos];
}
i++;
}
}
}));
}
for (Future<?> future : futures)
future.get();
long time = System.currentTimeMillis() - start;
// System.out.println(i + ") Running time: " + time + " ms");
measurements[i] = time;
}
es.shutdown();
System.out.println("par = " + par + " arr.length= "+ s + " >>> All running times: " + Arrays.toString(measurements));
}
}
}
this shows the distance between access values matters. By allocating an array is each thread, you use different TLABs (which space out the data in blocks)
par = 8 arr.length= 64 >>> All running times: [539, 413, 444, 444, 457, 444, 456]
par = 8 arr.length= 256 >>> All running times: [398, 527, 514, 529, 445, 441, 445]
par = 8 arr.length= 1024 >>> All running times: [419, 507, 477, 422, 412, 452, 396]
par = 8 arr.length= 4096 >>> All running times: [316, 282, 250, 232, 242, 229, 238]
par = 8 arr.length= 16384 >>> All running times: [316, 207, 209, 212, 208, 208, 208]
par = 8 arr.length= 65536 >>> All running times: [211, 211, 208, 208, 208, 291, 206]
par = 8 arr.length= 262144 >>> All running times: [366, 210, 210, 210, 210, 209, 211]
par = 8 arr.length= 1048576 >>> All running times: [296, 211, 215, 216, 213, 211, 211]
if you move the array inside the thread you get
par = 8 arr.length= 64 >>> All running times: [225, 151, 151, 150, 152, 153, 152]
par = 8 arr.length= 256 >>> All running times: [155, 151, 151, 151, 151, 151, 155]
par = 8 arr.length= 1024 >>> All running times: [153, 152, 151, 151, 151, 155, 152]
par = 8 arr.length= 4096 >>> All running times: [155, 156, 151, 152, 151, 155, 155]
par = 8 arr.length= 16384 >>> All running times: [154, 157, 152, 152, 158, 153, 153]
par = 8 arr.length= 65536 >>> All running times: [155, 157, 152, 184, 181, 154, 153]
par = 8 arr.length= 262144 >>> All running times: [240, 159, 166, 151, 172, 154, 160]
par = 8 arr.length= 1048576 >>> All running times: [165, 162, 163, 162, 163, 162, 163]
Turn off the tlab with -XX:-UseTLAB
and the same code give syou
par = 8 arr.length= 64 >>> All running times: [608, 467, 467, 457, 468, 461, 428]
par = 8 arr.length= 256 >>> All running times: [437, 437, 522, 512, 522, 369, 535]
par = 8 arr.length= 1024 >>> All running times: [394, 395, 475, 525, 470, 440, 478]
par = 8 arr.length= 4096 >>> All running times: [347, 215, 238, 226, 236, 204, 271]
par = 8 arr.length= 16384 >>> All running times: [291, 157, 178, 151, 150, 151, 152]
par = 8 arr.length= 65536 >>> All running times: [163, 152, 162, 151, 159, 159, 154]
par = 8 arr.length= 262144 >>> All running times: [164, 172, 152, 169, 160, 161, 160]
par = 8 arr.length= 1048576 >>> All running times: [295, 153, 164, 153, 166, 154, 163]
Solution
Run the JVM with the -XX:+UseCondCardMark
flag, available only in JDK7. This solves the problem.
Explanation
Essentially, most managed-heap environments use card tables to mark the areas of memory into which writes occurred. Such memory areas are marked as dirty in the card table once the write occurs. This information is needed for garbage collection - references of the non-dirty memory areas don't have to be scanned. A card is a contiguous block of memory, typically 512 bytes. A card table typically has 1 byte for each card - if this byte is set, the card is dirty. This means that a card table with 64 bytes covers 64 * 512 bytes of memory. And typically, the cache line size today is 64 bytes.
So each time a write to an object field occurs, the byte of the corresponding card in the card table must be set as dirty. A useful optimization in single thread programs is to do this by simply marking the relevant byte - do a write each time. An alternative of first checking whether the byte is set and a conditional write requires an additional read and a conditional jump, which is slightly slower.
However, this optimization can be catastrophic in the event that there are multiple processors writing to the memory, as neighbouring cards being written to require a write to neighbouring bytes in the card table. So the memory area being written to (the entry in the array above) is not in the same cache-line, which is the usual cause of memory contention. The real reason is that the dirty bytes which are written to are in the same cache line.
What the above flag does is - it implements the card table dirty byte write by first checking if the byte is already set, and setting it only if it isn't. This way the memory contention happens only during the first write to that card - after that, only reads to that cache-line occur. Since the cache-line is only read, it can be replicated across multiple processors and they don't have to synchronize to read it.
I've observed that this flag increases the running time some 15-20% in the 1-thread case.
The -XX:+UseCondCardMark
flag is explained in this blog post and this bug report.
The relevant concurrency mailing list discussion: Array allocation and access on the JVM.