What is the effect of ordering if…else if statements by probability?

后端 未结 10 1772
花落未央
花落未央 2020-12-07 13:39

Specifically, if I have a series of if...else if statements, and I somehow know beforehand the relative probability that each statement will evalua

相关标签:
10条回答
  • 2020-12-07 13:53

    No you should not, unless you are really sure that target system is affected. By default go by readability.

    I highly doubt your results. I've modified your example a bit, so reversing execution is easier. Ideone rather consistently shows that reverse-order is faster, though not much. On certain runs even this occasionally flipped. I'd say the results are inconclusive. coliru reports no real difference as well. I can check Exynos5422 CPU on my odroid xu4 later on.

    The thing is that modern CPUs have branch predictors. There is much-much logic dedicated to pre-fetching both data and instructions, and modern x86 CPUs are rather smart, when it comes to this. Some slimmer architectures like ARMs or GPUs might be vulnerable to this. But it is really highly dependent on both compiler and target system.

    I would say that branch ordering optimization is pretty fragile and ephemeral. Do it only as some really fine-tuning step.

    Code:

    #include <chrono>
    #include <iostream>
    #include <random>
    #include <algorithm>
    #include <iterator>
    #include <functional>
    
    using namespace std;
    
    int main()
    {
        //Generate a vector of random integers from 1 to 100
        random_device rnd_device;
        mt19937 rnd_engine(rnd_device());
        uniform_int_distribution<int> rnd_dist(1, 100);
        auto gen = std::bind(rnd_dist, rnd_engine);
        vector<int> rand_vec(5000);
        generate(begin(rand_vec), end(rand_vec), gen);
        volatile int nLow, nMid, nHigh;
    
        //Count the number of values in each of three different ranges
        //Run the test a few times
        for (int n = 0; n != 10; ++n) {
    
            //Run the test again, but now sort the conditional statements in reverse-order of likelyhood
            {
              nLow = nMid = nHigh = 0;
              auto start = chrono::high_resolution_clock::now();
              for (int& i : rand_vec) {
                  if (i >= 95) ++nHigh;               //Least likely branch
                  else if (i < 20) ++nLow;
                  else if (i >= 20 && i < 95) ++nMid; //Most likely branch
              }
              auto end = chrono::high_resolution_clock::now();
              cout << "Reverse-sorted: \t" << chrono::duration_cast<chrono::nanoseconds>(end-start).count() << "ns" << endl;
            }
    
            {
              //Sort the conditional statements in order of likelyhood
              nLow = nMid = nHigh = 0;
              auto start = chrono::high_resolution_clock::now();
              for (int& i : rand_vec) {
                  if (i >= 20 && i < 95) ++nMid;  //Most likely branch
                  else if (i < 20) ++nLow;
                  else if (i >= 95) ++nHigh;      //Least likely branch
              }
              auto end = chrono::high_resolution_clock::now();
              cout << "Sorted:\t\t\t" << chrono::duration_cast<chrono::nanoseconds>(end-start).count() << "ns" << endl;
            }
            cout << endl;
        }
    }
    
    0 讨论(0)
  • 2020-12-07 13:53

    Also depends on your compiler and the platform you’re compiling for.

    In theory, the most likely condition should make the control jump as less as possible.

    Typically the most likely condition should be first:

    if (most_likely) {
         // most likely instructions
    } else …
    

    The most popular asm’s are based on conditional branches that jump when condition is true. That C code will be likely translated to such pseudo asm:

    jump to ELSE if not(most_likely)
    // most likely instructions
    jump to end
    ELSE:
    …
    

    This is because jumps make the cpu cancel the execution pipeline and stall because the program counter changed (for architectures that support pipelines which are really common). Then it’s about the compiler, which may or may not apply some sophisticated optimizations about having the statistically most probably condition to get the control make less jumps.

    0 讨论(0)
  • 2020-12-07 13:54

    Put them in whatever logical order you like. Sure, the branch may be slower, but branching should not be the majority of work your computer is doing.

    If you are working on a performance critical portion of code, then certainly use logical order, profile guided optimization and other techniques, but for general code, I think its really more of a stylistic choice.

    0 讨论(0)
  • 2020-12-07 14:01

    Based on some of the other answers here, it looks like the only real answer is: it depends. It depends on at least the following (though not necessarily in this order of importance):

    • Relative probability of each branch. This is the original question that was asked. Based on the existing answers, there seems to be some conditions under which ordering by probability helps, but it appears to not always be the case. If the relative probabilities are not very different, then it is unlikely to make any difference what order they are in. However, if the first condition happens 99.999% of the time and the next one is a fraction of what is left, then I would assume that putting the most likely one first would be beneficial in terms of timing.
    • Cost of calculating the true/false condition for each branch. If the time cost of testing the conditions is really high for one branch versus another, then this is likely to have a significant impact on the timing and efficiency. For example, consider a condition that takes 1 time unit to calculate (e.g., checking the state of a Boolean variable) versus another condition that takes tens, hundreds, thousands, or even millions of time units to calculate (e.g., checking the contents of a file on disk or performing a complex SQL query against a large database). Assuming the code checks the conditions in order each time, the faster conditions should probably be first (unless they are dependent on other conditions failing first).
    • Compiler/Interpreter Some compilers (or interpreters) may include optimizations of one kind of another that can affect performance (and some of these are only present if certain options are selected during compilation and/or execution). So unless you are benchmarking two compilations and executions of otherwise identical code on the same system using the exact same compiler where the only difference is the order of the branches in question, you're going to have to give some leeway for compiler variations.
    • Operating System/Hardware As mentioned by luk32 and Yakk, various CPUs have their own optimizations (as do operating systems). So benchmarks are again susceptible to variation here.
    • Frequency of code block execution If the block that includes the branches is rarely accessed (e.g., only once during startup), then it probably matters very little what order you put the branches. On the other hand, if your code is hammering away at this code block during a critical part of your code, then ordering may matter a lot (depending on benchmarks).

    The only way to know for certain is to benchmark your specific case, preferably on a system identical to (or very similar to) the intended system on which the code will finally run. If it is intended to run on a set of varying systems with differing hardware, operating system, etc., then it is a good idea to benchmark across multiple variations to see which is best. It may even be a good idea to have the code be compiled with one ordering on one type of system and another ordering on another type of system.

    My personal rule of thumb (for most cases, in the absence of a benchmark) is to order based on:

    1. Conditions that rely on the result of prior conditions,
    2. Cost of computing the condition, then
    3. Relative probability of each branch.
    0 讨论(0)
  • 2020-12-07 14:02

    I decided to rerun the test on my own machine using Lik32 code. I had to change it due to my windows or compiler thinking high resolution is 1ms, using

    mingw32-g++.exe -O3 -Wall -std=c++11 -fexceptions -g

    vector<int> rand_vec(10000000);
    

    GCC has made the same transformation on both original codes.

    Note that only the two first conditions are tested as the third must always be true, GCC is a kind of a Sherlock here.

    Reverse

    .L233:
            mov     DWORD PTR [rsp+104], 0
            mov     DWORD PTR [rsp+100], 0
            mov     DWORD PTR [rsp+96], 0
            call    std::chrono::_V2::system_clock::now()
            mov     rbp, rax
            mov     rax, QWORD PTR [rsp+8]
            jmp     .L219
    .L293:
            mov     edx, DWORD PTR [rsp+104]
            add     edx, 1
            mov     DWORD PTR [rsp+104], edx
    .L217:
            add     rax, 4
            cmp     r14, rax
            je      .L292
    .L219:
            mov     edx, DWORD PTR [rax]
            cmp     edx, 94
            jg      .L293 // >= 95
            cmp     edx, 19
            jg      .L218 // >= 20
            mov     edx, DWORD PTR [rsp+96]
            add     rax, 4
            add     edx, 1 // < 20 Sherlock
            mov     DWORD PTR [rsp+96], edx
            cmp     r14, rax
            jne     .L219
    .L292:
            call    std::chrono::_V2::system_clock::now()
    
    .L218: // further down
            mov     edx, DWORD PTR [rsp+100]
            add     edx, 1
            mov     DWORD PTR [rsp+100], edx
            jmp     .L217
    
    And sorted
    
            mov     DWORD PTR [rsp+104], 0
            mov     DWORD PTR [rsp+100], 0
            mov     DWORD PTR [rsp+96], 0
            call    std::chrono::_V2::system_clock::now()
            mov     rbp, rax
            mov     rax, QWORD PTR [rsp+8]
            jmp     .L226
    .L296:
            mov     edx, DWORD PTR [rsp+100]
            add     edx, 1
            mov     DWORD PTR [rsp+100], edx
    .L224:
            add     rax, 4
            cmp     r14, rax
            je      .L295
    .L226:
            mov     edx, DWORD PTR [rax]
            lea     ecx, [rdx-20]
            cmp     ecx, 74
            jbe     .L296
            cmp     edx, 19
            jle     .L297
            mov     edx, DWORD PTR [rsp+104]
            add     rax, 4
            add     edx, 1
            mov     DWORD PTR [rsp+104], edx
            cmp     r14, rax
            jne     .L226
    .L295:
            call    std::chrono::_V2::system_clock::now()
    
    .L297: // further down
            mov     edx, DWORD PTR [rsp+96]
            add     edx, 1
            mov     DWORD PTR [rsp+96], edx
            jmp     .L224
    

    So this doesn't tell us much except that the last case doesn't need a branch predict.

    Now I tried all 6 combinations of the if's, the top 2 are the original reverse and sorted. high is >= 95, low is < 20, mid is 20-94 with 10000000 iterations each.

    high, low, mid: 43000000ns
    mid, low, high: 46000000ns
    high, mid, low: 45000000ns
    low, mid, high: 44000000ns
    mid, high, low: 46000000ns
    low, high, mid: 44000000ns
    
    high, low, mid: 44000000ns
    mid, low, high: 47000000ns
    high, mid, low: 44000000ns
    low, mid, high: 45000000ns
    mid, high, low: 46000000ns
    low, high, mid: 45000000ns
    
    high, low, mid: 43000000ns
    mid, low, high: 47000000ns
    high, mid, low: 44000000ns
    low, mid, high: 45000000ns
    mid, high, low: 46000000ns
    low, high, mid: 44000000ns
    
    high, low, mid: 42000000ns
    mid, low, high: 46000000ns
    high, mid, low: 46000000ns
    low, mid, high: 45000000ns
    mid, high, low: 46000000ns
    low, high, mid: 43000000ns
    
    high, low, mid: 43000000ns
    mid, low, high: 47000000ns
    high, mid, low: 44000000ns
    low, mid, high: 44000000ns
    mid, high, low: 46000000ns
    low, high, mid: 44000000ns
    
    high, low, mid: 43000000ns
    mid, low, high: 48000000ns
    high, mid, low: 44000000ns
    low, mid, high: 44000000ns
    mid, high, low: 45000000ns
    low, high, mid: 45000000ns
    
    high, low, mid: 43000000ns
    mid, low, high: 47000000ns
    high, mid, low: 45000000ns
    low, mid, high: 45000000ns
    mid, high, low: 46000000ns
    low, high, mid: 44000000ns
    
    high, low, mid: 43000000ns
    mid, low, high: 47000000ns
    high, mid, low: 45000000ns
    low, mid, high: 45000000ns
    mid, high, low: 46000000ns
    low, high, mid: 44000000ns
    
    high, low, mid: 43000000ns
    mid, low, high: 46000000ns
    high, mid, low: 45000000ns
    low, mid, high: 45000000ns
    mid, high, low: 45000000ns
    low, high, mid: 44000000ns
    
    high, low, mid: 42000000ns
    mid, low, high: 46000000ns
    high, mid, low: 44000000ns
    low, mid, high: 45000000ns
    mid, high, low: 45000000ns
    low, high, mid: 44000000ns
    
    1900020, 7498968, 601012
    
    Process returned 0 (0x0)   execution time : 2.899 s
    Press any key to continue.
    

    So why is the order high, low, med then faster (marginally)

    Because the most unpredictable is last and therefore is never run through a branch predictor.

              if (i >= 95) ++nHigh;               // most predictable with 94% taken
              else if (i < 20) ++nLow; // (94-19)/94% taken ~80% taken
              else if (i >= 20 && i < 95) ++nMid; // never taken as this is the remainder of the outfalls.
    

    So the branches will be predicted taken, taken and remainder with

    6%+(0.94*)20% mispredicts.

    "Sorted"

              if (i >= 20 && i < 95) ++nMid;  // 75% not taken
              else if (i < 20) ++nLow;        // 19/25 76% not taken
              else if (i >= 95) ++nHigh;      //Least likely branch
    

    The branches will be predicted with not taken, not taken and Sherlock.

    25%+(0.75*)24% mispredicts

    Giving 18-23% difference (measured difference of ~9%) but we need to calculate cycles instead of mispredicting %.

    Let's assume 17 cycles mispredict penalty on my Nehalem CPU and that each check takes 1 cycle to issue (4-5 instructions) and the loop takes one cycle too. The data dependencies are the counters and the loop variables, but once the mispredicts are out of the way it shouldn't influence the timing.

    So for "reverse", we get the timings (this should be the formula used in Computer Architecture: A Quantitative Approach IIRC).

    mispredict*penalty+count+loop
    0.06*17+1+1+    (=3.02)
    (propability)*(first check+mispredict*penalty+count+loop)
    (0.19)*(1+0.20*17+1+1)+  (= 0.19*6.4=1.22)
    (propability)*(first check+second check+count+loop)
    (0.75)*(1+1+1+1) (=3)
    = 7.24 cycles per iteration
    

    and the same for "sorted"

    0.25*17+1+1+ (=6.25)
    (1-0.75)*(1+0.24*17+1+1)+ (=.25*7.08=1.77)
    (1-0.75-0.19)*(1+1+1+1)  (= 0.06*4=0.24)
    = 8.26
    

    (8.26-7.24)/8.26 = 13.8% vs. ~9% measured (close to the measured!?!).

    So the obvious of the OP is not obvious.

    With these tests, other tests with more complicated code or more data dependencies will certainly be different so measure your case.

    Changing the order of the test changed the results but that could be because of different alignments of the loop start which should ideally be 16 bytes aligned on all newer Intel CPUs but isn't in this case.

    0 讨论(0)
  • 2020-12-07 14:07

    I made up the following test to time the execution of two different if...else if blocks, one sorted in order of probability, the other sorted in reverse order:

    #include <chrono>
    #include <iostream>
    #include <random>
    #include <algorithm>
    #include <iterator>
    #include <functional>
    
    using namespace std;
    
    int main()
    {
        long long sortedTime = 0;
        long long reverseTime = 0;
    
        for (int n = 0; n != 500; ++n)
        {
            //Generate a vector of 5000 random integers from 1 to 100
            random_device rnd_device;
            mt19937 rnd_engine(rnd_device());
            uniform_int_distribution<int> rnd_dist(1, 100);
            auto gen = std::bind(rnd_dist, rnd_engine);
            vector<int> rand_vec(5000);
            generate(begin(rand_vec), end(rand_vec), gen);
    
            volatile int nLow, nMid, nHigh;
            chrono::time_point<chrono::high_resolution_clock> start, end;
    
            //Sort the conditional statements in order of increasing likelyhood
            nLow = nMid = nHigh = 0;
            start = chrono::high_resolution_clock::now();
            for (int& i : rand_vec) {
                if (i >= 95) ++nHigh;               //Least likely branch
                else if (i < 20) ++nLow;
                else if (i >= 20 && i < 95) ++nMid; //Most likely branch
            }
            end = chrono::high_resolution_clock::now();
            reverseTime += chrono::duration_cast<chrono::nanoseconds>(end-start).count();
    
            //Sort the conditional statements in order of decreasing likelyhood
            nLow = nMid = nHigh = 0;
            start = chrono::high_resolution_clock::now();
            for (int& i : rand_vec) {
                if (i >= 20 && i < 95) ++nMid;  //Most likely branch
                else if (i < 20) ++nLow;
                else if (i >= 95) ++nHigh;      //Least likely branch
            }
            end = chrono::high_resolution_clock::now();
            sortedTime += chrono::duration_cast<chrono::nanoseconds>(end-start).count();
    
        }
    
        cout << "Percentage difference: " << 100 * (double(reverseTime) - double(sortedTime)) / double(sortedTime) << endl << endl;
    }
    

    Using MSVC2017 with /O2, the results show that the sorted version is consistently about 28% faster than the unsorted version. Per luk32's comment, I also switched the order of the two tests, which makes a noticeable difference (22% vs 28%). The code was run under Windows 7 on an Intel Xeon E5-2697 v2. This is, of course, very problem-specific and should not be interpreted as a conclusive answer.

    0 讨论(0)
提交回复
热议问题