Why reading a volatile and writing to a field member is not scalable in Java?

前端 未结 5 462
遇见更好的自我
遇见更好的自我 2021-01-30 22:04

Observe the following program written in Java (complete runnable version follows, but the important part of the program is in the snippet a little bit further below):

         


        
5条回答
  •  无人及你
    2021-01-30 22:12

    Let's try to get the JVM to behave a bit more "consistently." The JIT compiler is really throwing off comparisons of test runs; so let's disable the JIT compiler by using -Djava.compiler=NONE. This definitely introduces a performance hit, but will help eliminate the obscurity and effects of JIT compiler optimizations.

    Garbage collection introduces its own set of complexities. Let's use the serial garbage collector by using -XX:+UseSerialGC. Let's also disable explicit garbage collections and turn on some logging to see when garbage collection is performed: -verbose:gc -XX:+DisableExplicitGC. Finally, let's get enough heap allocated using -Xmx128m -Xms128m.

    Now we can run the test using:

    java -XX:+UseSerialGC -verbose:gc -XX:+DisableExplicitGC -Djava.compiler=NONE -Xmx128m -Xms128m -server -Dsize=50000000 -Dpar=1 MultiVolatileJavaExperiment 10
    

    Running the test multiple times shows the results are very consistent (I'm using Oracle Java 1.6.0_24-b07 on Ubuntu 10.04.3 LTS with an Intel(R) Core(TM)2 Duo CPU P8700 @ 2.53GHz), averaging somewhere about 2050 milliseconds. If I comment out the bar = vfoo line, I'm consistently averaging about 1280 milliseconds. Running the test using -Dpar=2 results with an average about 1350 milliseconds with bar = vfoo and about 1005 milliseconds with it commented.

    +=========+======+=========+
    | Threads | With | Without |
    +=========+======+=========+
    |    1    | 2050 |  1280   |
    +---------+------+---------+
    |    2    | 1350 |  1005   |
    +=========+======+=========+
    

    Let's now look at the code and see if we can spot any reasons why multi-threading is inefficient. In Reader.run(), qualifying variable with this as appropriate will help make it clear which variables are local:

    int i = 0;
    while (i < this.sz) {
        this.vfoo.x = 1;
        this.bar = this.vfoo;
        i++;
    }
    

    First thing to notice is the while loop contains four variables referenced through this. This means the code is accessing the class's runtime constant pool and performing type-checking (via the getfield bytecode instruction). Let's change the code to try and eliminate accessing the runtime constant pool and see if we get any benefits.

    final int mysz = this.sz;
    int i = 0;
    while (i < mysz) {
        this.vfoo.x = 1;
        this.bar = this.vfoo;
        i++;
    }
    

    Here, we're using a local mysz variable to access the loop size and only accessing sz through this once, for initialization. Running the test, with two threads, averages about 1295 milliseconds; a small benefit, but one nonetheless.

    Looking at the while loop, do we really need to reference this.vfoo twice? The two volatile reads create two synchronization edges that the virtual machine (and underlying hardware, for that matter) needs to manage. Let's say we do want one synchronization edge at the beginning of the while loop and we don't need two, we can use the following:

    final int mysz = this.sz;
    Foo myvfoo = null;
    int i = 0;
    while (i < mysz) {
        myvfoo = this.vfoo;
        myvfoo.x = 1;
        this.bar = myvfoo;
        i++;
    }
    

    This averages about 1122 milliseconds; still getting better. What about that this.bar reference? Since we are talking multi-threading, let's say the calculations in the while loop is what we want to get multi-threaded benefit from and this.bar is how we communicate our results to others. We really don't want to set this.bar until after the while loop is done.

    final int mysz = this.sz;
    Foo myvfoo = null;
    Foo mybar = null;
    int i = 0;
    while (i < mysz) {
        myvfoo = this.vfoo;
        myvfoo.x = 1;
        mybar = myvfoo;
        i++;
    }
    this.bar = mybar;
    

    Which gives us about 857 milliseconds on average. There's still that final this.vfoo reference in the while loop. Assuming again that the while loop is what we want multi-threaded benefit from, let's move that this.vfoo out of the while loop.

    final int mysz = this.sz;
    final Foo myvfoo = this.vfoo;
    Foo mybar = null;
    int i = 0;
    while (i < mysz) {
        myvfoo.x = 1;
        mybar = myvfoo;
        i++;
    }
    final Foo vfoocheck = this.vfoo;
    if (vfoocheck != myvfoo) {
        System.out.println("vfoo changed from " + myvfoo + " to " + vfoocheck);
    }
    this.bar = mybar;
    

    Now we average about 502 milliseconds; single-threaded test averages about 900 milliseconds.

    So what does this tell us? By extrapolating non-local variable references out of the while loop, there has been significant performance benefits both in the single- and double-threaded tests. The original version of MultiVolatileJavaExperiment was measuring the cost of accessing non-local variables 50,000,000 times, while the final version is measuring the cost of accessing local variables 50,000,000 times. By using local variables, you increase the likelihood that the Java Virtual Machine and underlying hardware can manage the thread caches more efficiently.

    Finally, let's run the tests normally using (notice, using 500,000,000 loop size instead of 50,000,000):

    java -Xmx128m -Xms128m -server -Dsize=500000000 -Dpar=2 MultiVolatileJavaExperiment 10
    

    The original version averages about 1100 milliseconds and the modified version averages about 10 millisecond.

提交回复
热议问题