问题
consider we have this:
....
pxor xmm1, xmm1
movdqu xmm0, [reax]
pcmpeqb xmm0, xmm1
pmovmskb eax, xmm0
test ax , ax
jz .zero
...
is there any way to not use 'pmovmskb' and test the bitmask directly from xmm0 (to check if it's zero) ? is there any SSE instruction for this action ?
in fact, im searching for something like 'ptest xmm0, xmm0' action but in SSE2 ... not SSE4
回答1:
It's generally not worth using SSE4.1 ptest xmm0,xmm0
on a pcmpeqb
result, especially not if you're branching.
pmovmskb
is 1 uop, and cmp
or test
can macro-fuse with jnz
into another single uop on both Intel and AMD CPUs. Total of 2 uops to branch on a pcmpeqb result with pmovmsk + test/jcc
But ptest
is 2 uops, and its 2nd uop can't macro-fuse with a following branch. Total of 3 uops to branch on a vector with ptest
+ jcc.
It's break-even when you can use ptest
directly, without needing a pcmp
, e.g. testing any / all bits in the whole vector (or with a mask, some bits). And actually a win if you use it for cmov or setcc instead of a branch. It's also a win for code-size, even though same number of uops.
You can amortize the checking over multiple vectors. e.g. por
some vectors together and then check that all of the bytes zero. Or pminub
some vectors together and then check for any zeros. (glibc string functions like strlen and strchr use this trick to check a whole cache-line of vectors in parallel, before sorting out where it came from after leaving the loop.)
You can combine pcmpeq results instead of raw inputs, e.g. for memchr. In that case you can use pand
instead of pminub
to get a zero in an element where any input has a zero. Some CPUs run pand
on more ports than pminub
, so less competition for vector ALU.
Also note that pmovmskb zero-extends into EAX; you can test eax,eax
instead of wasting a prefix byte to only test AX.
回答2:
Use ptest:
ptest xmm0, xmm0
jz .zero
ptest a, b
sets ZF if a
∧ b
is zero and CF if a
∧ ¬ b
is zero.
Note however that SSE 4.1 is required for ptest
to be present.
Otherwise, I suppose your approach is as good as it gets.
来源:https://stackoverflow.com/questions/60446759/sse2-test-xmm-bitmask-directly-without-using-pmovmskb