x86 max/min asm instructions?

前端 未结 6 823
礼貌的吻别
礼貌的吻别 2021-02-06 06:55

Are there any asm instructions that can speed up computation of min/max of vector of doubles/integers on Core i7 architecture?

Update:

I didn\'t

相关标签:
6条回答
  • 2021-02-06 07:00

    Update: I just realized you said "array", not "vector" in part 2. I'll leave this here anyway in case it's useful.


    re: part two: find the index of the max/min element in an SSE vector:

    • Do a horizontal maximum. For a 128b vector of 2 double elements, that's just one shufpd + maxpd to leave the result broadcast to both elements.

      For other cases, it will of course take more steps. See Fastest way to do horizontal float vector sum on x86 for ideas, replacing addps with maxps or minps. (But note that 16-bit integer is special, because you can use SSE4 phminposuw. For max, subtract from 255)

    • Do a packed-compare between the vector original vector and the vector where every element is the max.

      (pcmpeqq integer bit patterns or the usual cmpeqpd would both work for the double case).

    • int _mm_movemask_pd (__m128d a) (movmskpd) to get the compare result as an integer bitmap.
    • bit-scan (bsf) it for the (first) match: index = _bit_scan_forward(cmpmask). cmpmask = 0 is impossible if you used integer compares (because at least one element will match even if they are NaN).

    This should compile to only 6 instructions (including a movapd). Yup, just checked on the Godbolt compiler explorer and it does, with SSE.

    #include <immintrin.h>
    #include <x86intrin.h>
    
    int maxpos(__m128d v) {
      __m128d swapped = _mm_shuffle_pd(v,v, 1);
      __m128d maxbcast = _mm_max_pd(swapped, v);
      __m128d cmp = _mm_cmpeq_pd(maxbcast, v);
      int cmpmask = _mm_movemask_pd(cmp);
      return _bit_scan_forward(cmpmask);
    }
    

    Note that _mm_max_pd is not commutative with NaN inputs. If NaN is possible, and you don't care about performance on Intel Nehalem, you might consider using _mm_cmpeq_epi64 to compare bit-patterns. Bypass-delay from float to vec-int is a problem on Nehalem, though.

    NaN != NaN in IEEE floating point, so the _mm_cmpeq_pd result mask could be all-zero in the all-NaN case.

    Another thing you can do in the 2-element case to always get a 0 or 1 is to replace the bit-scan with cmpmask >> 1. (bsf is weird with input = all-zero).

    0 讨论(0)
  • 2021-02-06 07:04

    if your are using Intel's IPP library you can use the vector statistical functions to calculate vector min/max (among other things)

    0 讨论(0)
  • 2021-02-06 07:08

    In response to your second question, it may be worthwhile to you to think about the way you collect and store this data.

    You may store the data in a B-tree that keeps the data sorted at all times, requiring only logarithmic compare operations.

    Then you know at all times where the maximum is.

    http://en.wikipedia.org/wiki/B_tree

    0 讨论(0)
  • 2021-02-06 07:16

    SSE4 has PMAXSD or PMAXUD for 32 bit signed/unsigned integers, which might be useful.

    SSE2 has MAXPD and MAXSD which compare between and across pairs of doubles, so you follow n/2-1 MAXPDs with one MAXSD to get the max of a vector of n, with the usual interlacing of loads and operations.

    There are MIN equivalents of the above.

    For the double case, you're probably not going to do better in assembler than a half-decent C++ compiler in SSE mode:

    peregrino:$ g++ -O3 src/min_max.cpp -o bin/min_max
    peregrino:$ g++ -O3 -msse4 -mfpmath=sse src/min_max.cpp -o bin/min_max_sse
    peregrino:$ time bin/min_max
    0,40
    
    real    0m0.874s
    user    0m0.796s
    sys 0m0.004s
    peregrino:$ time bin/min_max_sse 
    0,40
    
    real    0m0.457s
    user    0m0.404s
    sys 0m0.000s
    

    where min_max computes min and max of an array of 500 doubles 100,000 times using a naive loop:

    bool min_max ( double array[], size_t len, double& min, double& max )
    {
        double min_value = array [ 0 ];
        double max_value = array [ 0 ];
    
        for ( size_t index = 1; index < len; ++index ) {
            if ( array [ index ] < min_value ) min_value = array [ index ];
            if ( array [ index ] > max_value ) max_value = array [ index ];
        }
    
        min = min_value;
        max = max_value;
    }
    

    In response to part two, the traditional optimisation to remove branching from a max operation is to compare the values, get the flag as a single bit ( giving 0 or 1 ), subtract one ( giving 0 or 0xffff_ffff) and 'and' it with the xor of the two possible results, so you get the equivalent of ( a > best ? ( current_index ^ best_index ) : 0 ) ^ best_index ). I doubt there's a simple SSE way of doing that, simply because SSE tends to operate on packed values rather than tagged values; there are some horizontal index operations, so you could try finding the max, then subtracting that from all elements in the original vector, then gather the sign bit, and the zero signed one would correspond to the index of the max, but that would probably not be an improvement unless you were using shorts or bytes.

    0 讨论(0)
  • 2021-02-06 07:18

    MAXPS and MINPS from SSE both operate on packed single-precision floating point numbers. PMAXSW, PMINSW, PMAXUB and PMINUB all operate on packed 8-bit words, either signed or unsigned. Please note that these compare the two input SSE registers or address locations element-wise and store the result into an SSE register or memory location.

    The SSE2 versions of MAXPS and MINPS should work on double-precision floats.

    What compiler and optimization flags are you using? gcc 4.0 and better should automatically vectorize operations if your target supports them, earlier versions may need a specific flag.

    0 讨论(0)
  • 2021-02-06 07:25

    In response to your second question: on most platforms, there are libraries that already contained optimized implementations of this very operation (and most other simple vector operations). Use them.

    • On OS X, there is vDSP_maxviD( ) and cblas_idamax( ) in the Accelerate.framework
    • The Intel compilers include the IPP and MKL libraries, which have high performance implementations, including cblas_idamax( )
    • Most Linux systems will have cblas_idamax( ) in the BLAS library, which may or may not be well-tuned depending on its provenance; users who care about performance will generally have a good implementation (or can be persuaded to install one)
    • If all else fails, you can use ATLAS (Automatically Tuned Linear Algebra Software) to get a decent performance implementation on the target platform
    0 讨论(0)
提交回复
热议问题