Is it true that having lots of small methods helps the JIT compiler optimize?

后端未结

关注

 4  423

In a recent discussion about how to optimize some code, I was told that breaking code up into lots of small methods can significantly increase performance, because the JIT compi

相关标签:

4条回答

灰色年华

2021-01-30 17:08

I don't really understand how it works, but based on the link AurA provided, I would guess that the JIT compiler will have to compile less bytecode if the same bits are being reused, rather than having to compile different bytecode that is similar across different methods.

Aside from that, the more you are able to break down your code into pieces of sense, the more reuse you are going to get out of your code and that is something that will allow optimization for the VM running it (you are providing more schema to work with).

However I doubt it will have any good impact if you break your code down without any sense that provides no code reuse.

0 讨论(0)
发布评论:

提交评论
- 加载中...
孤城傲影

2021-01-30 17:09
The Hotspot JIT only inlines methods that are less than a certain (configurable) size. So using smaller methods allows more inlining, which is good.

See the various inlining options on this page.

EDIT

To elaborate a little:
- if a method is small it will get inlined so there is little chance to get penalised for splitting the code in small methods.
- in some instances, splitting methods may result in more inlining.
Example (full code to have the same line numbers if you try it)
```
package javaapplication27;

public class TestInline {
    private int count = 0;

    public static void main(String[] args) throws Exception {
        TestInline t = new TestInline();
        int sum = 0;
        for (int i  = 0; i < 1000000; i++) {
            sum += t.m();
        }
        System.out.println(sum);
    }

    public int m() {
        int i = count;
        if (i % 10 == 0) {
            i += 1;
        } else if (i % 10 == 1) {
            i += 2;
        } else if (i % 10 == 2) {
            i += 3;
        }
        i += count;
        i *= count;
        i++;
        return i;
    }
}
```
When running this code with the following JVM flags: -XX:+UnlockDiagnosticVMOptions -XX:+PrintCompilation -XX:FreqInlineSize=50 -XX:MaxInlineSize=50 -XX:+PrintInlining (yes I have used values that prove my case: m is too big but both the refactored m and m2 are below the threshold - with other values you might get a different output).

You will see that m() and main() get compiled, but m() does not get inlined:
```
 56    1             javaapplication27.TestInline::m (62 bytes)
 57    1 %           javaapplication27.TestInline::main @ 12 (53 bytes)
          @ 20   javaapplication27.TestInline::m (62 bytes)   too big
```
You can also inspect the generated assembly to confirm that m is not inlined (I used these JVM flags: -XX:+PrintAssembly -XX:PrintAssemblyOptions=intel) - it will look like this:
```
0x0000000002780624: int3   ;*invokevirtual m
                           ; - javaapplication27.TestInline::main@20 (line 10)
```
If you refactor the code like this (I have extracted the if/else in a separate method):
```
public int m() {
    int i = count;
    i = m2(i);
    i += count;
    i *= count;
    i++;
    return i;
}

public int m2(int i) {
    if (i % 10 == 0) {
        i += 1;
    } else if (i % 10 == 1) {
        i += 2;
    } else if (i % 10 == 2) {
        i += 3;
    }
    return i;
}
```
You will see the following compilation actions:
```
 60    1             javaapplication27.TestInline::m (30 bytes)
 60    2             javaapplication27.TestInline::m2 (40 bytes)
            @ 7   javaapplication27.TestInline::m2 (40 bytes)   inline (hot)
 63    1 %           javaapplication27.TestInline::main @ 12 (53 bytes)
            @ 20   javaapplication27.TestInline::m (30 bytes)   inline (hot)
            @ 7   javaapplication27.TestInline::m2 (40 bytes)   inline (hot)
```
So m2 gets inlined into m, which you would expect so we are back to the original scenario. But when main gets compiled, it actually inlines the whole thing. At the assembly level, it means you won't find any invokevirtual instructions any more. You will find lines like this:
```
 0x00000000026d0121: add    ecx,edi   ;*iinc
                                      ; - javaapplication27.TestInline::m2@7 (line 33)
                                      ; - javaapplication27.TestInline::m@7 (line 24)
                                      ; - javaapplication27.TestInline::main@20 (line 10)
```
where basically common instructions are "mutualised".

Conclusion

I am not saying that this example is representative but it seems to prove a few points:
- using smaller method improves readability in your code
- smaller methods will generally be inlined, so you will most likely not pay the cost of the extra method call (it will be performance neutral)
- using smaller methods might improve inlining globally in some circumstances, as shown by the example above
And finally: if a portion of your code is really critical for performance that these considerations matter, you should examine the JIT output to fine tune your code and importantly profile before and after.
0 讨论(0)
发布评论:

提交评论
- 加载中...
野趣味

2021-01-30 17:18

If you take the exact same code and just break them up into lots of small methods, that is not going to help JIT at all.

A better way to put it is that modern HotSpot JVMs do not penalize you for writing a lot of small methods. They do get aggressively inlined, so at runtime you do not really pay the cost of function calls. This is true even for invokevirtual calls, such as the one that calls an interface method.

I did a blog post several years ago that describes how you can see JVM is inlining methods. The technique is still applicable to modern JVMs. I also found it useful to look at the discussions related to invokedynamic, where how the modern HotSpot JVMs compiles Java byte code gets discussed extensively.

0 讨论(0)
发布评论:

提交评论
- 加载中...
旧时难觅i

2021-01-30 17:28
I've read numerous articles which have stated that smaller methods (as measured in the number of bytes required to represent the method as Java bytecode) are more likely to be eligible for inlining by the JIT (just-in-time compiler) when it compiles hot methods (those which are being run most frequently) into machine code. And they describe how method inlining produces better performance of the resulting machine code. In short: smaller methods give the JIT more options in terms of how to compile bytecode into machine code when it identifies a hot method, and this allows more sophisticated optimizations.

To test this theory, I created a JMH class with two benchmark methods, each containing identical behaviour but factored differently. The first benchmark is named monolithicMethod (all code in a single method), and the second benchmark is named smallFocusedMethods and has been refactored so that each major behaviour has been moved out into its own method. The smallFocusedMethods benchmark look like this:
```
@Benchmark
public void smallFocusedMethods(TestState state) {
    int i = state.value;
    if (i < 90) {
        actionOne(i, state);
    } else {
        actionTwo(i, state);
    }
}

private void actionOne(int i, TestState state) {
    state.sb.append(Integer.toString(i)).append(
            ": has triggered the first type of action.");
    int result = i;
    for (int j = 0; j < i; ++j) {
        result += j;
    }
    state.sb.append("Calculation gives result ").append(Integer.toString(
            result));
}

private void actionTwo(int i, TestState state) {
    state.sb.append(i).append(" has triggered the second type of action.");
    int result = i;
    for (int j = 0; j < 3; ++j) {
        for (int k = 0; k < 3; ++k) {
            result *= k * j + i;
        }
    }
    state.sb.append("Calculation gives result ").append(Integer.toString(
            result));
}
```
and you can imagine how monolithicMethod looks (same code but entirely contained within the one method). The TestState simply does the work of creating a new StringBuilder (so that the creation of this object is not counted in the benchmark time) and of choosing a random number between 0 and 100 for each invocation (and this has been deliberately configured so that both benchmarks use exactly the same sequence of random numbers, to avoid the risk of bias).

After running the benchmark with six "forks", each involving five warmups of one second, followed by six iterations of five seconds, the results look like this:
```
Benchmark                                         Mode   Cnt        Score        Error   Units

monolithicMethod                                  thrpt   30  7609784.687 ± 118863.736   ops/s
monolithicMethod:·gc.alloc.rate                   thrpt   30     1368.296 ±     15.834  MB/sec
monolithicMethod:·gc.alloc.rate.norm              thrpt   30      270.328 ±      0.016    B/op
monolithicMethod:·gc.churn.G1_Eden_Space          thrpt   30     1357.303 ±     16.951  MB/sec
monolithicMethod:·gc.churn.G1_Eden_Space.norm     thrpt   30      268.156 ±      1.264    B/op
monolithicMethod:·gc.churn.G1_Old_Gen             thrpt   30        0.186 ±      0.001  MB/sec
monolithicMethod:·gc.churn.G1_Old_Gen.norm        thrpt   30        0.037 ±      0.001    B/op
monolithicMethod:·gc.count                        thrpt   30     2123.000               counts
monolithicMethod:·gc.time                         thrpt   30     1060.000                   ms

smallFocusedMethods                               thrpt   30  7855677.144 ±  48987.206   ops/s
smallFocusedMethods:·gc.alloc.rate                thrpt   30     1404.228 ±      8.831  MB/sec
smallFocusedMethods:·gc.alloc.rate.norm           thrpt   30      270.320 ±      0.001    B/op
smallFocusedMethods:·gc.churn.G1_Eden_Space       thrpt   30     1393.473 ±     10.493  MB/sec
smallFocusedMethods:·gc.churn.G1_Eden_Space.norm  thrpt   30      268.250 ±      1.193    B/op
smallFocusedMethods:·gc.churn.G1_Old_Gen          thrpt   30        0.186 ±      0.001  MB/sec
smallFocusedMethods:·gc.churn.G1_Old_Gen.norm     thrpt   30        0.036 ±      0.001    B/op
smallFocusedMethods:·gc.count                     thrpt   30     1986.000               counts
smallFocusedMethods:·gc.time                      thrpt   30     1011.000                   ms
```
In short, these numbers show that the smallFocusedMethods approach ran 3.2% faster, and the difference was statistically significant (with 99.9% confidence). And note that the memory usage (based on garbage collection profiling) was not significantly different. So you get faster performance without increased overhead.

I've run a variety of similar benchmarks to test whether small, focused methods give better throughput, and I've found that the improvement is between 3% and 7% in all cases I've tried. But it's likely that the actual gain depends strongly upon the version of the JVM being used, the distribution of executions across your if/else blocks (I've gone for 90% on the first and 10% on the second to exaggerate the heat on the first "action", but I've seen throughput improvements even with a more equal spread across a chain of if/else blocks), and the actual complexity of the work being done by each of the possible actions. So be sure to write your own specific benchmarks if you need to determine what works for your specific application.

My advice is this: write small, focused methods because it makes the code tidier, easier to read, and much easier to override specific behaviours when inheritance is involved. The fact that the JIT is likely to reward you with slightly better performance is a bonus, but tidy code should be your main goal in the majority of cases. Oh, and it's also important to give each method a clear, descriptive name which exactly summarises the responsibility of the method (unlike the terrible names I've used in my benchmark).
0 讨论(0)
发布评论:

提交评论
- 加载中...