C++ fast division/mod by 10^x

后端 未结 10 519
我寻月下人不归
我寻月下人不归 2020-12-03 05:13

In my program I use a lot of integer division by 10^x and integer mod function of power 10.

For example:

unsigned __int64 a = 12345;
a = a / 100;
...         


        
相关标签:
10条回答
  • 2020-12-03 05:30

    This is great for environments that lack any div operation and its only ~2x slower than native division on my i7 (optimizations off, naturally).

    Here's a slightly faster version of the algorithm, though there are still some nasty rounding errors with negative numbers.

    static signed Div10(signed n)
    {
        n = (n >> 1) + (n >> 2);
        n += n < 0 ? 9 : 2;
        n = n + (n >> 4);
        n = n + (n >> 8);
        n = n + (n >> 16);
        n = n >> 3;
        return n;
    }
    

    Since this method is for 32-bit integer precision, you can optimize away most of these shifts if you're working in an 8-bit or 16-bit environment.

    0 讨论(0)
  • 2020-12-03 05:35

    Short Answer: NO

    Long Answer: NO.

    Explanation:
    The compiler is already optimizing statements like this for you.
    If there is a technique for implementing this quicker than an integer division then the compiler already knows about it and will apply it (assuming you turn on optimizations).

    If you provide the appropriate architecture flags as well then the compiler may even know about specific fast architecture specific assembles that will provide a nice trick for doing the operation otherwise it will apply the best trick for the generic architecture it was compiled for.

    In short the compiler will beat the human 99.9999999% of the time in any optimization trick (try it remember to add the optimization flag and architecture flags). So the best you can normally do is equal the compiler.

    If by some miracle you discover a method that has not already been found by the Assembly boffins that work closely with the backend compiler team. Then please let them know and the next version of the popular compilers will be updated with the 'unknown (google)' division by 10 optimization trick.

    0 讨论(0)
  • 2020-12-03 05:38

    Short Answer: THAT DEPENDS.

    Long Answer:

    Yes, it is very possible IF you can use things that the compiler cannot automatically deduce. However, in my experience this is quite rare; most compilers are pretty good at vectorizing nowadays. However, much depends on how you model your data and how willing you are to create incredibly complex code. For most users, I wouldn't recommend going through the trouble in the first place.

    To give you an example, here's the implementation of x / 10 where x is a signed integer (this is actually what the compiler will generate):

    int eax = value * 0x66666667;
    int edx = ([overflow from multiplication] >> 2); // NOTE: use aritmetic shift here!
    int result = (edx >> 31) + edx;
    

    If you disassemble your compiled C++ code, and you used a constant for the '10', it will show the assembly code reflecting the above. If you didn't use a constant, it'll generate a idiv, which is much slower.

    Knowing your memory is aligned c.q. knowing that your code can be vectorized, is something that can be very beneficial. Do note that this does require you to store your data in such a way that this is possible.

    For example, if you want to calculate the sum-of-div/10's of all integers, you can do something like this:

        __m256i ctr = _mm256_set_epi32(0, 1, 2, 3, 4, 5, 6, 7);
        ctr = _mm256_add_epi32(_mm256_set1_epi32(INT32_MIN), ctr);
    
        __m256i sumdiv = _mm256_set1_epi32(0);
        const __m256i magic = _mm256_set1_epi32(0x66666667);
        const int shift = 2;
    
        // Show that this is correct:
        for (long long int i = INT32_MIN; i <= INT32_MAX; i += 8)
        {
            // Compute the overflow values
            __m256i ovf1 = _mm256_srli_epi64(_mm256_mul_epi32(ctr, magic), 32);
            __m256i ovf2 = _mm256_mul_epi32(_mm256_srli_epi64(ctr, 32), magic);
    
            // blend the overflows together again
            __m256i rem = _mm256_srai_epi32(_mm256_blend_epi32(ovf1, ovf2, 0xAA), shift);
    
            // calculate the div value
            __m256i div = _mm256_add_epi32(rem, _mm256_srli_epi32(rem, 31));
    
            // do something with the result; increment the counter
            sumdiv = _mm256_add_epi32(sumdiv, div);
            ctr = _mm256_add_epi32(ctr, _mm256_set1_epi32(8));
        }
    
        int sum = 0;
        for (int i = 0; i < 8; ++i) { sum += sumdiv.m256i_i32[i]; }
        std::cout << sum << std::endl;
    

    If you benchmark both implementations, you will find that on an Intel Haswell processor, you'll get these results:

    • idiv: 1,4 GB/s
    • compiler optimized: 4 GB/s
    • AVX2 instructions: 16 GB/s

    For other powers of 10 and unsigned division, I recommend reading the paper.

    0 讨论(0)
  • 2020-12-03 05:42

    From http://www.hackersdelight.org/divcMore.pdf

    unsigned divu10(unsigned n) {
    unsigned q, r;
    q = (n >> 1) + (n >> 2);
    q = q + (q >> 4);
    q = q + (q >> 8);
    q = q + (q >> 16);
    q = q >> 3;
    r = n - q*10;
    return q + ((r + 6) >> 4);
    
    }
    
    0 讨论(0)
提交回复
热议问题