x86 max/min asm instructions?

前端 未结 6 828
礼貌的吻别
礼貌的吻别 2021-02-06 06:55

Are there any asm instructions that can speed up computation of min/max of vector of doubles/integers on Core i7 architecture?

Update:

I didn\'t

6条回答
  •  猫巷女王i
    2021-02-06 07:16

    SSE4 has PMAXSD or PMAXUD for 32 bit signed/unsigned integers, which might be useful.

    SSE2 has MAXPD and MAXSD which compare between and across pairs of doubles, so you follow n/2-1 MAXPDs with one MAXSD to get the max of a vector of n, with the usual interlacing of loads and operations.

    There are MIN equivalents of the above.

    For the double case, you're probably not going to do better in assembler than a half-decent C++ compiler in SSE mode:

    peregrino:$ g++ -O3 src/min_max.cpp -o bin/min_max
    peregrino:$ g++ -O3 -msse4 -mfpmath=sse src/min_max.cpp -o bin/min_max_sse
    peregrino:$ time bin/min_max
    0,40
    
    real    0m0.874s
    user    0m0.796s
    sys 0m0.004s
    peregrino:$ time bin/min_max_sse 
    0,40
    
    real    0m0.457s
    user    0m0.404s
    sys 0m0.000s
    

    where min_max computes min and max of an array of 500 doubles 100,000 times using a naive loop:

    bool min_max ( double array[], size_t len, double& min, double& max )
    {
        double min_value = array [ 0 ];
        double max_value = array [ 0 ];
    
        for ( size_t index = 1; index < len; ++index ) {
            if ( array [ index ] < min_value ) min_value = array [ index ];
            if ( array [ index ] > max_value ) max_value = array [ index ];
        }
    
        min = min_value;
        max = max_value;
    }
    

    In response to part two, the traditional optimisation to remove branching from a max operation is to compare the values, get the flag as a single bit ( giving 0 or 1 ), subtract one ( giving 0 or 0xffff_ffff) and 'and' it with the xor of the two possible results, so you get the equivalent of ( a > best ? ( current_index ^ best_index ) : 0 ) ^ best_index ). I doubt there's a simple SSE way of doing that, simply because SSE tends to operate on packed values rather than tagged values; there are some horizontal index operations, so you could try finding the max, then subtracting that from all elements in the original vector, then gather the sign bit, and the zero signed one would correspond to the index of the max, but that would probably not be an improvement unless you were using shorts or bytes.

提交回复
热议问题