Why is the execution time of this function call changing?

后端 未结 2 419
日久生厌
日久生厌 2021-02-03 17:10

Preface

This issue seems to only affect Chrome/V8, and may not be reproducible in Firefox or other browsers. In summary, the execution time of a functi

相关标签:
2条回答
  • Since this is getting so much interest (and updates to the question), I thought I'd provide some additional detail.

    The new simplified test case is great: it's very simple, and very clearly shows a problem.

    function test(callback) {
      let start = performance.now();
      for (let i = 0; i < 1e6; i++) callback();
      console.log(`${callback.name} took ${(performance.now() - start).toFixed(2)}ms`);
    }
    
    var exampleA = (a,b) => 10**10;
    var exampleB = (a,b) => 10**10;
    
    // one callback -> fast
    for (let i = 0; i < 10; i++) test(exampleA);
    
    // introduce a second callback -> much slower forever
    for (let i = 0; i < 10; i++) test(exampleB);
    for (let i = 0; i < 10; i++) test(exampleA);
    

    On my machine, I'm seeing times go as low as 0.23 ms for exampleA alone, and then they go up to 7.3ms when exampleB comes along, and stay there. Wow, a 30x slowdown! Clearly that's a bug in V8? Why wouldn't the team jump on fixing this?

    Well, the situation is more complicated than it seems at first.

    Firstly, the "slow" case is the normal situation. That's what you should expect to see in most code. It's still pretty fast! You can do a million function calls (plus a million exponentiations, plus a million loop iterations) in just 7 milliseconds! That's only 7 nanoseconds per iteration+call+exponentiation+return!

    Actually, that analysis was a bit simplified. In reality, an operation on two constants like 10**10 will be constant-folded at compile time, so once exampleA and exampleB get optimized, the optimized code for them will return 1e10 immediately, without doing any multiplications. On the flip side, the code here contains a small oversight that causes the engine to have to do more work: exampleA and exampleB take two parameters (a, b), but they're called without any arguments simply as callback(). Bridging this difference between expected and actual number of parameters is fast, but on a test like this that doesn't do much else, it amounts to about 40% of the total time spent. So a more accurate statement would be: it takes about 4 nanoseconds to do a loop iteration plus a function call plus a materialization of a number constant plus a function return, or 7 ns if the engine additionally has to adapt the arguments count of the call.

    So what about the initial results for just exampleA, how can that case be so much faster? Well, that's the lucky situation that hits various optimizations in V8 and can take several shortcuts -- in fact it can take so many shortcuts that it ends up being a misleading microbenchmark: the results it produces don't reflect real situations, and can easily cause an observer to draw incorrect conclusions. The general effect that "always the same callback" is (typically) faster than "several different callbacks" is certainly real, but this test significantly distorts the magnitude of the difference. At first, V8 sees that it's always the same function that's getting called, so the optimizing compiler decides to inline the function instead of calling it. That avoids the adaptation of arguments right off the bat. After inlining, the compiler can also see that the result of the exponentiation is never used, so it drops that entirely. The end result is that this test tests an empty loop! See for yourself:

    function test_empty(no_callback) {
      let start = performance.now();
      for (let i = 0; i < 1e6; i++) {}
      console.log(`empty loop took ${(performance.now() - start).toFixed(2)}ms`);
    }
    

    That gives me the same 0.23ms as calling exampleA. So contrary to what we thought, we didn't measure the time it takes to call and execute exampleA, in reality we measured no calls at all, and no 10**10 exponentiations either. (If you like more direct proof, you can run the original test in d8 or node with --print-opt-code and see the disassembly of the optimized code that V8 generates internally.)

    All that lets us conclude a few things:

    (1) This is not a case of "OMG there's this horrible slowdown that you must be aware of and avoid in your code". The default performance you get when you don't worry about this is great. Sometimes when the stars align you might see even more impressive optimizations, but… to put it lightly: just because you only get presents on a few occasions per year, doesn't mean that all the other non-gift-bearing days are some horrible bug that must be avoided.

    (2) The smaller your test case, the bigger the observed difference between default speed and lucky fast case. If your callbacks are doing actual work that the compiler can't just eliminate, then the difference will be smaller than seen here. If your callbacks are doing more work than a single operation, then the fraction of overall time that's spent on the call itself will be smaller, so replacing the call with inlining will make less of a difference than it does here. If your functions are called with the parameters they need, that will avoid the needless penalization seen here. So while this microbenchmark manages to create the misleading impression that there's a shockingly large 30x difference, in most real applications it will be between maybe 4x in extreme cases and "not even measurable at all" for many other cases.

    (3) Function calls do have a cost. It's great that (for many languages, including JavaScript) we have optimizing compilers that can sometimes avoid them via inlining. If you have a case where you really, really care about every last bit of performance, and your compiler happens to not inline what you think it should be inlining (for whatever reason: because it can't, or because it has internal heuristics that decide not to), then it can give significant benefits to redesign your code a bit -- e.g. you could inline by hand, or otherwise restructure your control flow to avoid millions of calls to tiny functions in your hottest loops. (Don't blindly overdo it though: having too few too big functions isn't great for optimization either. Usually it's best to not worry about this. Organize your code into chunks that make sense, let the engine take care of the rest. I'm only saying that sometimes, when you observe specific problems, you can help the engine do its job better.) If you do need to rely on performance-sensitive function calls, then an easy tuning you can do is to make sure that you're calling your functions with exactly as many arguments as they expect -- which is probably often what you would do anyway. Of course optional arguments have their uses as well; like in so many other cases the extra flexibility comes with a (small) performance cost, which is often negligible, but can be taken into consideration when you feel that you have to.

    (4) Observing such performance differences can understandably be surprising and sometimes even frustrating. Unfortunately, the nature of optimizations is such that they can't always be applied: they rely on making simplifying assumptions and not covering every case, otherwise they wouldn't be fast any more. We work very hard to give you reliable, predictable performance, with as many fast cases and as few slow cases as possible, and no steep cliffs between them. But we cannot escape the reality that we can't possibly "just make everything fast". (Which of course isn't to say that there's nothing left to do: every additional year of engineering work brings additional performance gains.) If we wanted to avoid all cases where more-or-less similar code exhibits noticeably different performance, then the only way to accomplish that would be to not do any optimizations at all, and instead leave everything at baseline ("slow") implementations -- and I don't think that would make anyone happy.

    EDIT to add: It seems there are major differences between different CPUs here, which probably explains why previous commenters have reported so wildly differing results. On hardware I can get my hands on, I'm seeing:

    • i7 6600U: 3.3 ms for inlined case, 28 ms for calling
    • i7 3635QM: 2.8 ms for inlined case, 10 ms for calling
    • i7 3635QM, up-to-date microcode: 2.8 ms for inlined case, 26 ms for calling
    • Ryzen 3900X: 2.5 ms for inlined case, 5 ms for calling

    This is all with Chrome 83/84 on Linux; it's very much possible that running on Windows or Mac would yield different results (because CPU/microcode/kernel/sandbox are closely interacting with each other). If you find these hardware differences shocking, read up on "spectre".

    0 讨论(0)
  • 2021-02-03 17:52

    V8 developer here. It's not a bug, it's just an optimization that V8 doesn't do. It's interesting to see that Firefox seems to do it...

    FWIW, I don't see "ballooning to 400ms"; instead (similar to Jon Trent's comment) I see about 2.5ms at first, and then around 11ms.

    Here's the explanation:

    When you click only one button, then transition only ever sees one callback. (Strictly speaking it's a new instance of the arrow function every time, but since they all stem from the same function in the source, they're "deduped" for type feedback tracking purposes. Also, strictly speaking it's one callback each for stateTransition and transitionCondition, but that just duplicates the situation; either one alone would reproduce it.) When transition gets optimized, the optimizing compiler decides to inline the called function, because having seen only one function there in the past, it can make a high-confidence guess that it's also always going to be that one function in the future. Since the function does extremely little work, avoiding the overhead of calling it provides a huge performance boost.

    Once the second button is clicked, transition sees a second function. It must get deoptimized the first time this happens; since it's still hot it'll get reoptimized soon after, but this time the optimizer decides not to inline, because it's seen more than one function before, and inlining can be very expensive. The result is that from this point onwards, you'll see the time it takes to actually perform these calls. (The fact that both functions have identical source doesn't matter; checking that wouldn't be worth it because outside of toy examples that would almost never be the case.)

    There's a workaround, but it's something of a hack, and I don't recommend putting hacks into user code to account for engine behavior. V8 does support "polymorphic inlining", but (currently) only if it can deduce the call target from some object's type. So if you construct "config" objects that have the right functions installed as methods on their prototype, you can get V8 to inline them. Like so:

    class StateMachine {
      ...
      transition(config, maxCalls = Infinity) {
        let i = 0;
        while (
          config.condition &&
          config.condition(this.state) &&
          i++ < maxCalls
        ) config.transition(this.state);
    
        return this;
      }
      ...
    }
    
    class RedConfig {
      transition(state) { return state.test++ }
      condition(state) { return state.test < 1e6 }
    }
    class BlueConfig {
      transition(state) { return state.test++ }
      condition(state) { return state.test < 1e6 }
    }
    
    function red() {
      ...
      stateMachine.transition(new RedConfig());
      ...
    }
    function blue() {
      ...
      stateMachine.transition(new BlueConfig());
      ...
    }
    

    It might be worth filing a bug (crbug.com/v8/new) to ask if the compiler team thinks that this is worth improving. Theoretically it should be possible to inline several functions that are called directly, and branch between the inlined paths based on the value of the function variable that's being called. However I'm not sure there are many cases where the impact is as pronounced as in this simple benchmark, and I know that recently the trend has been towards inlining less rather than more, because on average that tends to be the better tradeoff (there are drawbacks to inlining, and whether it's worth it is necessarily always a guess, because the engine would have to predict the future in order to be sure).

    In conclusion, coding with many callbacks is a very flexible and often elegant technique, but it tends to come at an efficiency cost. (There are other varieties of inefficiency: e.g. a call with an inline arrow function like transition(state => state.something) allocates a new function object each time it's executed; that just so happens not to matter much in the example at hand.) Sometimes engines might be able to optimize away the overhead, and sometimes not.

    0 讨论(0)
提交回复
热议问题