Observe the following program written in Java (complete runnable version follows, but the important part of the program is in the snippet a little bit further below):
Let's try to get the JVM to behave a bit more "consistently." The JIT compiler is really throwing off comparisons of test runs; so let's disable the JIT compiler by using -Djava.compiler=NONE
. This definitely introduces a performance hit, but will help eliminate the obscurity and effects of JIT compiler optimizations.
Garbage collection introduces its own set of complexities. Let's use the serial garbage collector by using -XX:+UseSerialGC
. Let's also disable explicit garbage collections and turn on some logging to see when garbage collection is performed: -verbose:gc -XX:+DisableExplicitGC
. Finally, let's get enough heap allocated using -Xmx128m -Xms128m
.
Now we can run the test using:
java -XX:+UseSerialGC -verbose:gc -XX:+DisableExplicitGC -Djava.compiler=NONE -Xmx128m -Xms128m -server -Dsize=50000000 -Dpar=1 MultiVolatileJavaExperiment 10
Running the test multiple times shows the results are very consistent (I'm using Oracle Java 1.6.0_24-b07 on Ubuntu 10.04.3 LTS with an Intel(R) Core(TM)2 Duo CPU P8700 @ 2.53GHz), averaging somewhere about 2050 milliseconds. If I comment out the bar = vfoo
line, I'm consistently averaging about 1280 milliseconds. Running the test using -Dpar=2
results with an average about 1350 milliseconds with bar = vfoo
and about 1005 milliseconds with it commented.
+=========+======+=========+
| Threads | With | Without |
+=========+======+=========+
| 1 | 2050 | 1280 |
+---------+------+---------+
| 2 | 1350 | 1005 |
+=========+======+=========+
Let's now look at the code and see if we can spot any reasons why multi-threading is inefficient. In Reader.run()
, qualifying variable with this
as appropriate will help make it clear which variables are local:
int i = 0;
while (i < this.sz) {
this.vfoo.x = 1;
this.bar = this.vfoo;
i++;
}
First thing to notice is the while
loop contains four variables referenced through this
. This means the code is accessing the class's runtime constant pool and performing type-checking (via the getfield
bytecode instruction). Let's change the code to try and eliminate accessing the runtime constant pool and see if we get any benefits.
final int mysz = this.sz;
int i = 0;
while (i < mysz) {
this.vfoo.x = 1;
this.bar = this.vfoo;
i++;
}
Here, we're using a local mysz
variable to access the loop size and only accessing sz
through this
once, for initialization. Running the test, with two threads, averages about 1295 milliseconds; a small benefit, but one nonetheless.
Looking at the while
loop, do we really need to reference this.vfoo
twice? The two volatile reads create two synchronization edges that the virtual machine (and underlying hardware, for that matter) needs to manage. Let's say we do want one synchronization edge at the beginning of the while
loop and we don't need two, we can use the following:
final int mysz = this.sz;
Foo myvfoo = null;
int i = 0;
while (i < mysz) {
myvfoo = this.vfoo;
myvfoo.x = 1;
this.bar = myvfoo;
i++;
}
This averages about 1122 milliseconds; still getting better. What about that this.bar
reference? Since we are talking multi-threading, let's say the calculations in the while
loop is what we want to get multi-threaded benefit from and this.bar
is how we communicate our results to others. We really don't want to set this.bar
until after the while
loop is done.
final int mysz = this.sz;
Foo myvfoo = null;
Foo mybar = null;
int i = 0;
while (i < mysz) {
myvfoo = this.vfoo;
myvfoo.x = 1;
mybar = myvfoo;
i++;
}
this.bar = mybar;
Which gives us about 857 milliseconds on average. There's still that final this.vfoo
reference in the while
loop. Assuming again that the while
loop is what we want multi-threaded benefit from, let's move that this.vfoo
out of the while
loop.
final int mysz = this.sz;
final Foo myvfoo = this.vfoo;
Foo mybar = null;
int i = 0;
while (i < mysz) {
myvfoo.x = 1;
mybar = myvfoo;
i++;
}
final Foo vfoocheck = this.vfoo;
if (vfoocheck != myvfoo) {
System.out.println("vfoo changed from " + myvfoo + " to " + vfoocheck);
}
this.bar = mybar;
Now we average about 502 milliseconds; single-threaded test averages about 900 milliseconds.
So what does this tell us? By extrapolating non-local variable references out of the while
loop, there has been significant performance benefits both in the single- and double-threaded tests. The original version of MultiVolatileJavaExperiment
was measuring the cost of accessing non-local variables 50,000,000 times, while the final version is measuring the cost of accessing local variables 50,000,000 times. By using local variables, you increase the likelihood that the Java Virtual Machine and underlying hardware can manage the thread caches more efficiently.
Finally, let's run the tests normally using (notice, using 500,000,000 loop size instead of 50,000,000):
java -Xmx128m -Xms128m -server -Dsize=500000000 -Dpar=2 MultiVolatileJavaExperiment 10
The original version averages about 1100 milliseconds and the modified version averages about 10 millisecond.