问题
We use inline assembly to make SHA instructions available if __SHA__
is not defined. Under GCC we use:
GCC_INLINE __m128i GCC_INLINE_ATTRIB
MM_SHA256RNDS2_EPU32(__m128i a, const __m128i b, const __m128i c)
{
asm ("sha256rnds2 %2, %1, %0" : "+x"(a) : "xm"(b), "Yz" (c));
return a;
}
Clang does not consume GCC's Yz constraint (see Clang 3.2 Issue 13199 and Clang 3.9 Issue 32727), which is required by the sha256rnds2
instruction:
Yz First SSE register (%xmm0).
We added a mov
for Clang:
asm ("mov %2, %%xmm0; sha256rnds2 %%xmm0, %1, %0" : "+x"(a) : "xm"(b), "x" (c) : "xmm0");
Performance is off by about 3 cycles per byte. On my 2.2 GHz Celeron J3455 test machine (Goldmont with SHA extensions), that's about 230 MiB/s. Its non-trivial.
Looking at the disassembly, Clang is not optimizing around SHA's k
when two rounds are performed:
Breakpoint 2, SHA256_SSE_SHA_HashBlocks (state=0xaaa3a0,
data=0xaaa340, length=0x40) at sha.cpp:1101
1101 STATE1 = _mm_loadu_si128((__m128i*) &state[4]);
(gdb) disass
Dump of assembler code for function SHA256_SSE_SHA_HashBlocks(unsigned int*, unsigned int const*, unsigned long):
0x000000000068cdd0 <+0>: sub $0x308,%rsp
0x000000000068cdd7 <+7>: movdqu (%rdi),%xmm0
0x000000000068cddb <+11>: movdqu 0x10(%rdi),%xmm1
...
0x000000000068ce49 <+121>: movq %xmm2,%xmm0
0x000000000068ce4d <+125>: sha256rnds2 %xmm0,0x2f0(%rsp),%xmm1
0x000000000068ce56 <+134>: pshufd $0xe,%xmm2,%xmm3
0x000000000068ce5b <+139>: movdqa %xmm13,%xmm2
0x000000000068ce60 <+144>: movaps %xmm1,0x2e0(%rsp)
0x000000000068ce68 <+152>: movq %xmm3,%xmm0
0x000000000068ce6c <+156>: sha256rnds2 %xmm0,0x2e0(%rsp),%xmm2
0x000000000068ce75 <+165>: movdqu 0x10(%rsi),%xmm3
0x000000000068ce7a <+170>: pshufb %xmm8,%xmm3
0x000000000068ce80 <+176>: movaps %xmm2,0x2d0(%rsp)
0x000000000068ce88 <+184>: movdqa %xmm3,%xmm4
0x000000000068ce8c <+188>: paddd 0x6729c(%rip),%xmm4 # 0x6f4130
0x000000000068ce94 <+196>: movq %xmm4,%xmm0
0x000000000068ce98 <+200>: sha256rnds2 %xmm0,0x2d0(%rsp),%xmm1
...
For example, 0068ce8c
though 0068ce98
should have been:
paddd 0x6729c(%rip),%xmm0 # 0x6f4130
sha256rnds2 %xmm0,0x2d0(%rsp),%xmm1
I'm guessing our choice of inline asm instructions are a bit off.
How do we work around the lack of Yz
machine constraint under Clang? What pattern avoids the intermediate move in optimized code?
Attempting to use Explicit Register Variable:
const __m128i k asm("xmm0") = c;
asm ("sha256rnds2 %2, %1, %0" : "+x"(a) : "xm"(b), "x" (k));
return a;
Results in:
In file included from sha.cpp:24:
./cpu.h:831:22: warning: ignored asm label 'xmm0' on automatic variable
const __m128i k asm("xmm0") = c;
^
./cpu.h:833:7: error: invalid operand for instruction
asm ("sha256rnds2 %2, %1, %0" : "+x"(a) : "xm"(b), "x" (k));
^
<inline asm>:1:21: note: instantiated into assembly here
sha256rnds2 %xmm1, 752(%rsp), %xmm0
^~~~~~~~~~
In file included from sha.cpp:24:
./cpu.h:833:7: error: invalid operand for instruction
asm ("sha256rnds2 %2, %1, %0" : "+x"(a) : "xm"(b), "x" (k));
^
<inline asm>:1:21: note: instantiated into assembly here
sha256rnds2 %xmm3, 736(%rsp), %xmm1
^~~~~~~~~~
...
回答1:
I created this answer based on the tag inline assembly
with no specific language mentioned. Extended assembly templates already assume use of extensions to the languages.
If the Yz
constraint isn't available you can attempt to create a temporary variable to tell CLANG what register to use rather than a constraint. You can do this through what is called an Explicit Register Variable:
You can define a local register variable and associate it with a specified register like this:
register int *foo asm ("r12");
Here r12 is the name of the register that should be used. Note that this is the same syntax used for defining global register variables, but for a local variable the declaration appears within a function. The register keyword is required, and cannot be combined with static. The register name must be a valid register name for the target platform.
In your case you wish to force usage of xmm0
register. You could assign the c
parameter to a temporary variable using an explicit register and use that temporary as a parameter to the Extended Inline Assembly. This is the primary purpose of explicit registers in GCC/CLANG.
GCC_INLINE __m128i GCC_INLINE_ATTRIB
MM_SHA256RNDS2_EPU32(__m128i a, const __m128i b, const __m128i c)
{
register const __m128i tmpc asm("xmm0") = c;
__asm__("sha256rnds2 %2, %1, %0" : "+x"(a) : "x"(b), "x" (tmpc));
return a;
}
The compiler should be able to provide some optimizations now since it has more knowledge as to how the xmm0
register is to be used.
When you placed mov %2, %%xmm0;
into the template CLANG (and GCC) do not do any optimizations on the instructions. Basic Assembly and Extended Assembly templates are a black box that it only knows how to do basic substitution based on the constraints.
Here's a disassembly using the method above. It was compiled with clang++
and -std=c++03
. The extra moves are no longer present:
Breakpoint 1, SHA256_SSE_SHA_HashBlocks (state=0x7fffffffae60,
data=0x7fffffffae00, length=0x40) at sha.cpp:1101
1101 STATE1 = _mm_loadu_si128((__m128i*) &state[4]);
(gdb) disass
Dump of assembler code for function SHA256_SSE_SHA_HashBlocks(unsigned int*, unsigned int const*, unsigned long):
0x000000000068cf60 <+0>: sub $0x308,%rsp
0x000000000068cf67 <+7>: movdqu (%rdi),%xmm0
0x000000000068cf6b <+11>: movdqu 0x10(%rdi),%xmm1
...
0x000000000068cfe6 <+134>: paddd 0x670e2(%rip),%xmm0 # 0x6f40d0
0x000000000068cfee <+142>: sha256rnds2 %xmm0,0x2f0(%rsp),%xmm2
0x000000000068cff7 <+151>: pshufd $0xe,%xmm0,%xmm1
0x000000000068cffc <+156>: movdqa %xmm1,%xmm0
0x000000000068d000 <+160>: movaps %xmm2,0x2e0(%rsp)
0x000000000068d008 <+168>: sha256rnds2 %xmm0,0x2e0(%rsp),%xmm3
0x000000000068d011 <+177>: movdqu 0x10(%rsi),%xmm5
0x000000000068d016 <+182>: pshufb %xmm9,%xmm5
0x000000000068d01c <+188>: movaps %xmm3,0x2d0(%rsp)
0x000000000068d024 <+196>: movdqa %xmm5,%xmm0
0x000000000068d028 <+200>: paddd 0x670b0(%rip),%xmm0 # 0x6f40e0
0x000000000068d030 <+208>: sha256rnds2 %xmm0,0x2d0(%rsp),%xmm2
...
来源:https://stackoverflow.com/questions/43544072/work-around-lack-of-yz-machine-constraint-under-clang