Are there any asm instructions that can speed up computation of min/max of vector of doubles/integers on Core i7 architecture?
Update:
I didn\'t
Update: I just realized you said "array", not "vector" in part 2. I'll leave this here anyway in case it's useful.
re: part two: find the index of the max/min element in an SSE vector:
Do a horizontal maximum. For a 128b vector of 2 double
elements, that's just one shufpd
+ maxpd
to leave the result broadcast to both elements.
For other cases, it will of course take more steps. See Fastest way to do horizontal float vector sum on x86 for ideas, replacing addps
with maxps
or minps
. (But note that 16-bit integer is special, because you can use SSE4 phminposuw
. For max, subtract from 255)
Do a packed-compare between the vector original vector and the vector where every element is the max.
(pcmpeqq
integer bit patterns or the usual cmpeqpd
would both work for the double
case).
bsf
) it for the (first) match: index = _bit_scan_forward(cmpmask). cmpmask = 0 is impossible if you used integer compares (because at least one element will match even if they are NaN).This should compile to only 6 instructions (including a movapd
). Yup, just checked on the Godbolt compiler explorer and it does, with SSE.
#include
#include
int maxpos(__m128d v) {
__m128d swapped = _mm_shuffle_pd(v,v, 1);
__m128d maxbcast = _mm_max_pd(swapped, v);
__m128d cmp = _mm_cmpeq_pd(maxbcast, v);
int cmpmask = _mm_movemask_pd(cmp);
return _bit_scan_forward(cmpmask);
}
Note that _mm_max_pd is not commutative with NaN inputs. If NaN is possible, and you don't care about performance on Intel Nehalem, you might consider using _mm_cmpeq_epi64
to compare bit-patterns. Bypass-delay from float to vec-int is a problem on Nehalem, though.
NaN != NaN in IEEE floating point, so the _mm_cmpeq_pd
result mask could be all-zero in the all-NaN case.
Another thing you can do in the 2-element case to always get a 0 or 1 is to replace the bit-scan with cmpmask >> 1
. (bsf
is weird with input = all-zero).