For a hobby project I\'m working on, I need to emulate certain 64-bit integer operations on a x86 CPU, and it needs to be fast.
Currently, I\'m doing this via M
SSE2 has direct support for some 64-bit integer operations:
Set both elements to 0:
__m128i z = _mm_setzero_si128();
Set both elements to 1:
__m128i z = _mm_set1_epi64x(1); // also works for variables.
__m128i z = _mm_set_epi64x(hi, lo); // elements can be different
__m128i z = _mm_set_epi32(0,1,0,1); // if any compilers refuse int64_t in 32-bit mode. (None of the major ones do.)
Set/load the low 64 bits, zero-extending to __m128i
// supported even in 32-bit mode, and listed as an intrinsic for MOVQ
// so it should be atomic on aligned integers.
_mm_loadl_epi64((const __m128i*)p); // movq or movsd 64-bit load
_mm_cvtsi64x_si128(a); // only ICC, others refuse in 32-bit mode
_mm_loadl_epi64((const __m128i*)&a); // portable for a value instead of pointer
Things based on _mm_set_epi32
can get compiled into a mess by some compilers, so _mm_loadl_epi64
appears to be the best bet across MSVC and ICC as well as gcc/clang, and should actually be safe for your requirement of atomic 64-bit loads in 32-bit mode. See it on the Godbolt compiler explorer
Vertically add/subtract each 64-bit integer:
__m128i z = _mm_add_epi64(x,y)
__m128i z = _mm_sub_epi64(x,y)
http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011/compiler_c/intref_cls/common/intref_sse2_integer_arithmetic.htm#intref_sse2_integer_arithmetic
Left Shift:
__m128i z = _mm_slli_epi64(x,i) // i must be an immediate
http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011/compiler_c/intref_cls/common/intref_sse2_int_shift.htm
Bitwise operators:
__m128i z = _mm_and_si128(x,y)
__m128i z = _mm_or_si128(x,y)
http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011/compiler_c/intref_cls/common/intref_sse2_integer_logical.htm
SSE doesn't have increments, so you'll have to use a constant with 1
.
Comparisons are harder since there's no 64-bit support until SSE4.1 pcmpeqq and SSE4.2 pcmpgtq
Here's the one for equality:
__m128i t = _mm_cmpeq_epi32(a,b);
__m128i z = _mm_and_si128(t,_mm_shuffle_epi32(t,177));
This will set the each 64-bit element to 0xffffffffffff
(aka -1)
if they are equal. If you want it as a 0
or 1
in an int
, you can pull it out using _mm_cvtsi32_si128()
and add 1
. (But sometimes you can do total -= cmp_result;
instead of converting and adding.)
And Less-Than: (not fully tested)
a = _mm_xor_si128(a,_mm_set1_epi32(0x80000000));
b = _mm_xor_si128(b,_mm_set1_epi32(0x80000000));
__m128i t = _mm_cmplt_epi32(a,b);
__m128i u = _mm_cmpgt_epi32(a,b);
__m128i z = _mm_or_si128(t,_mm_shuffle_epi32(t,177));
z = _mm_andnot_si128(_mm_shuffle_epi32(u,245),z);
This will set the each 64-bit element to 0xffffffffffff
if the corresponding element in a
is less than b
.
Here's are versions of "equals" and "less-than" that return a bool. They return the result of the comparison for the bottom 64-bit integer.
inline bool equals(__m128i a,__m128i b){
__m128i t = _mm_cmpeq_epi32(a,b);
__m128i z = _mm_and_si128(t,_mm_shuffle_epi32(t,177));
return _mm_cvtsi128_si32(z) & 1;
}
inline bool lessthan(__m128i a,__m128i b){
a = _mm_xor_si128(a,_mm_set1_epi32(0x80000000));
b = _mm_xor_si128(b,_mm_set1_epi32(0x80000000));
__m128i t = _mm_cmplt_epi32(a,b);
__m128i u = _mm_cmpgt_epi32(a,b);
__m128i z = _mm_or_si128(t,_mm_shuffle_epi32(t,177));
z = _mm_andnot_si128(_mm_shuffle_epi32(u,245),z);
return _mm_cvtsi128_si32(z) & 1;
}