问题
This is my first question here...
I'm writing an arbitrary precision integer class to be used in C# (64-bit). Currently I'm working on the multiplication routine, using a recursive divide-and-conquer algorithm to break down the multi-bit multiplication into a series of primitive 64-to-128-bit multiplications, the results of which are recombined then by simple addition. In order to get a significant performance boost, I'm writing the code in native x64 C++, embedded in a C++/CLI wrapper to make it callable from C# code.
It all works great so far, regarding the algorithms. However, my problem is the optimization for speed. Since the 64-to-128-bit multiplication is the real bottleneck here, I tried to optimize my code right there. My first simple approach was a C++ function that implements this multiplication by performing four 32-to-64-bit multiplications and recombining the results with a couple of shifts and adds. This is the source code:
// 64-bit to 128-bit multiplication, using the following decomposition:
// (a*2^32 + i) (b*2^32 + i) = ab*2^64 + (aj + bi)*2^32 + ij
public: static void Mul (UINT64 u8Factor1,
UINT64 u8Factor2,
UINT64& u8ProductL,
UINT64& u8ProductH)
{
UINT64 u8Result1, u8Result2;
UINT64 u8Factor1L = u8Factor1 & 0xFFFFFFFFULL;
UINT64 u8Factor2L = u8Factor2 & 0xFFFFFFFFULL;
UINT64 u8Factor1H = u8Factor1 >> 32;
UINT64 u8Factor2H = u8Factor2 >> 32;
u8ProductL = u8Factor1L * u8Factor2L;
u8ProductH = u8Factor1H * u8Factor2H;
u8Result1 = u8Factor1L * u8Factor2H;
u8Result2 = u8Factor1H * u8Factor2L;
if (u8Result1 > MAX_UINT64 - u8Result2)
{
u8Result1 += u8Result2;
u8Result2 = (u8Result1 >> 32) | 0x100000000ULL; // add carry
}
else
{
u8Result1 += u8Result2;
u8Result2 = (u8Result1 >> 32);
}
if (u8ProductL > MAX_UINT64 - (u8Result1 <<= 32))
{
u8Result2++;
}
u8ProductL += u8Result1;
u8ProductH += u8Result2;
return;
}
This function expects two 64-bit values and returns a 128-bit result as two 64-bit quantities passed as reference. This works fine. In the next step, I tried to replace the call to this function by ASM code that calls the CPU's MUL instruction. Since there's no inline ASM in x64 mode anymore, the code must be put into a separate .asm file. This is the implementation:
_TEXT segment
; =============================================================================
; multiplication
; -----------------------------------------------------------------------------
; 64-bit to 128-bit multiplication, using the x64 MUL instruction
AsmMul1 proc ; ?AsmMul1@@$$FYAX_K0AEA_K1@Z
; ecx : Factor1
; edx : Factor2
; [r8] : ProductL
; [r9] : ProductH
mov rax, rcx ; rax = Factor1
mul rdx ; rdx:rax = Factor1 * Factor2
mov qword ptr [r8], rax ; [r8] = ProductL
mov qword ptr [r9], rdx ; [r9] = ProductH
ret
AsmMul1 endp
; =============================================================================
_TEXT ends
end
That's utmost simple and straightforward. The function is referenced from C++ code using an extern "C"
forward definition:
extern "C"
{
void AsmMul1 (UINT64, UINT64, UINT64&, UINT64&);
}
To my surprise, it turned out to be significantly slower than the C++ function. To properly benchmark the performance, I've written a C++ function that computes 10,000,000 pairs of pseudo-random unsigned 64-bit values and performs multiplications in a tight loop, using those implementations one after another, with exactly the same values. The code is compiled in Release mode with optimizations turned on. The time spent in the loop is 515 msec for the ASM version, compared to 125 msec (!) for the C++ version.
That's quite strange. So I opened the disassembly window in the debugger and copied the ASM code generated by the compiler. This is what I found there, slightly edited for readability and for use with MASM:
AsmMul3 proc ; ?AsmMul3@@$$FYAX_K0AEA_K1@Z
; ecx : Factor1
; edx : Factor2
; [r8] : ProductL
; [r9] : ProductH
mov eax, 0FFFFFFFFh
and rax, rcx
; UINT64 u8Factor2L = u8Factor2 & 0xFFFFFFFFULL;
mov r10d, 0FFFFFFFFh
and r10, rdx
; UINT64 u8Factor1H = u8Factor1 >> 32;
shr rcx, 20h
; UINT64 u8Factor2H = u8Factor2 >> 32;
shr rdx, 20h
; u8ProductL = u8Factor1L * u8Factor2L;
mov r11, r10
imul r11, rax
mov qword ptr [r8], r11
; u8ProductH = u8Factor1H * u8Factor2H;
mov r11, rdx
imul r11, rcx
mov qword ptr [r9], r11
; u8Result1 = u8Factor1L * u8Factor2H;
imul rax, rdx
; u8Result2 = u8Factor1H * u8Factor2L;
mov rdx, rcx
imul rdx, r10
; if (u8Result1 > MAX_UINT64 - u8Result2)
mov rcx, rdx
neg rcx
dec rcx
cmp rcx, rax
jae label1
; u8Result1 += u8Result2;
add rax, rdx
; u8Result2 = (u8Result1 >> 32) | 0x100000000ULL; // add carry
mov rdx, rax
shr rdx, 20h
mov rcx, 100000000h
or rcx, rdx
jmp label2
; u8Result1 += u8Result2;
label1:
add rax, rdx
; u8Result2 = (u8Result1 >> 32);
mov rcx, rax
shr rcx, 20h
; if (u8ProductL > MAX_UINT64 - (u8Result1 <<= 32))
label2:
shl rax, 20h
mov rdx, qword ptr [r8]
mov r10, rax
neg r10
dec r10
cmp r10, rdx
jae label3
; u8Result2++;
inc rcx
; u8ProductL += u8Result1;
label3:
add rdx, rax
mov qword ptr [r8], rdx
; u8ProductH += u8Result2;
add qword ptr [r9], rcx
ret
AsmMul3 endp
Copying this code into my MASM source file and calling it from my benchmark routine resulted in 547 msec spent in the loop. That's slightly slower than the ASM function, and considerably slower than the C++ function. That's even stranger, since the latter are supposed to execute exactly the same machine code.
So I tried another variant, this time using hand-optimized ASM code that does exactly the same four 32-to-64-bit multiplications, but in a more straightforward way. The code should avoid jumps and immediate values, make use of the CPU FLAGS for carry evaluation, and use interleaving of instructions in order to avoid register stalls. This is what I came up with:
; 64-bit to 128-bit multiplication, using the following decomposition:
; (a*2^32 + i) (b*2^32 + j) = ab*2^64 + (aj + bi)*2^32 + ij
AsmMul2 proc ; ?AsmMul2@@$$FYAX_K0AEA_K1@Z
; ecx : Factor1
; edx : Factor2
; [r8] : ProductL
; [r9] : ProductH
mov rax, rcx ; rax = Factor1
mov r11, rdx ; r11 = Factor2
shr rax, 32 ; rax = Factor1H
shr r11, 32 ; r11 = Factor2H
and ecx, ecx ; rcx = Factor1L
mov r10d, eax ; r10 = Factor1H
and edx, edx ; rdx = Factor2L
imul rax, r11 ; rax = ab = Factor1H * Factor2H
imul r10, rdx ; r10 = aj = Factor1H * Factor2L
imul r11, rcx ; r11 = bi = Factor1L * Factor2H
imul rdx, rcx ; rdx = ij = Factor1L * Factor2L
xor ecx, ecx ; rcx = 0
add r10, r11 ; r10 = aj + bi
adc ecx, ecx ; rcx = carry (aj + bi)
mov r11, r10 ; r11 = aj + bi
shl rcx, 32 ; rcx = carry (aj + bi) << 32
shl r10, 32 ; r10 = lower (aj + bi) << 32
shr r11, 32 ; r11 = upper (aj + bi) >> 32
add rdx, r10 ; rdx = ij + (lower (aj + bi) << 32)
adc rax, r11 ; rax = ab + (upper (aj + bi) >> 32)
mov qword ptr [r8], rdx ; save ProductL
add rax, rcx ; add carry (aj + bi) << 32
mov qword ptr [r9], rax ; save ProductH
ret
AsmMul2 endp
The benchmark yielded 500 msec, so this seems to be the fastest version of those three ASM implementations. However, the performance differences of them are quite marginal - but all of them are about four times slower than the naive C++ approach!
So what's going on here? It seems to me that there's some general performance penalty for calling ASM code from C++, but I can't find anything on the internet that might explain it. The way I'm interfacing ASM is exactly how Microsoft recommends it.
But now, watch out for another still stranger thing! Well, there are compiler intrinsics, anren't they? The _umul128
intrinsic supposedly should do exactly what my AsmMul1 function does, i.e. call the 64-bit CPU MUL instruction. So I replaced the AsmMul1 call by a corresponding call to _umul128
. Now see what performance values I've got in return (again, I'm running all four benchmarks sequentially in a single function):
_umul128: 109 msec
AsmMul2: 94 msec (hand-optimized ASM)
AsmMul3: 125 msec (compiler-generated ASM)
C++ function: 828 msec
Now the ASM versions are blazingly fast, with about the same relative differences as before. However, the C++ function is terribly lazy now! Somehow the use of an intrinsic turns the entire performance values upside down. Scary...
I haven't got any explanation for this strange behavior, and would be thankful at least for any hints about what's going on here. It would be even better if someone could explain how to get these performance issues under control. Currently I'm quite worried, because obviously a small change in the code can have huge performance impacts. I would like to understand the mechanisms underlying here, and how to get reliable results.
And another thing: Why is the 64-to-128-bit MUL slower than four 64-to-64-bit IMULs?!
Thanks in advance!
回答1:
After a lot of trial-and-error, and additional extensive research on the Internet, it seems I've found the reason for this strange performance behavior. The magic word is thunking of function entry points. But let me start from the beginning.
One observation I made is that it doesn't really matter which compiler intrinsic is used in order to turn my benchmark results upside down. Actually, it suffices to put a __nop()
(CPU NOP opcode) anywhere inside a function to trigger this effect. It works even if it's placed right before the return
. More tests have shown that the effect is restricted to the function that contains the intrinsic. The __nop()
does nothing with respect to the code flow, but obviously it changes the properties of the containing function.
I've found a question on stackoverflow that seems to tackle a similar problem: How to best avoid double thunking in C++/CLI native types In the comments, the following additional information is found:
One of my own classes in our base library - which uses MFC - is called about a million times. We are seeing massive sporadic performance issues, and firing up the profiler I can see a thunk right at the bottom of this chain. That thunk takes longer than the method call.
That's exactly what I'm observing as well - "something" on the way of the function call is taking about four times longer than my code. Function thunks are explained to some extend in the documentation of the __clrcall modifier and in an article about Double Thunking. In the former, there's a hint to a side effect of using intrinsics:
You can directly call __clrcall functions from existing C++ code that was compiled by using /clr as long as that function has an MSIL implementation. __clrcall functions cannot be called directly from functions that have inline asm and call CPU-specific intrinisics, for example, even if those functions are compiled with /clr.
So, as far as I understand it, a function that contains intrinsics loses its __clrcall
modifier which is added automatically when the /clr compiler switch is specified - which is usually the case if the C++ functions should be compiled to native code.
I don't get all of the details of this thunking and double thunking stuff, but obviously it is required to make unmanaged functions callable from managed functions. However, it is possible to switch it off per function by embedding it into a #pragma managed(push, off)
/ #pragma managed(pop)
pair. Unfortunately, this #pragma doesn't work inside namespace blocks, so some editing might be required to place it everywhere where it is supposed to occur.
I've tried this trick, placing all of my native multi-precision code inside this #pragma, and got the following benchmark results:
AsmMul1: 78 msec (64-to-128-bit CPU MUL)
AsmMul2: 94 msec (hand-optimized ASM, 4 x IMUL)
AsmMul3: 125 msec (compiler-generated ASM, 4 x IMUL)
C++ function: 109 msec
Now this looks reasonable, finally! Now all versions have about the same execution times, which is what I would expect from an optimized C++ program. Alas, there's still no happy end... Placing the winner AsmMul1
into my multi-precision multiplier yielded twice the execution time of the version with the C++ function without #pragma. The explanation is, in my opinion, that this code makes calls to unmanaged functions in other classes, which are outside the #pragma and hence have a __clrcall
modifier. This seems to create significant overhead again.
Frankly, I'm tired of investigating further into this issue. Although the ASM PROC with the single MUL instruction seems to beat all other attempts, the gain is not as big as expected, and getting the thunking out of the way leads to so many changes in my code that I don't think it's worth the hassle. So I'll go on with the C++ function I've written in the very beginning, originally destined to be just a placeholder for something better...
It seems to me that ASM interfacing in C++/CLI is not well supported, or maybe I'm still missing something basic here. Maybe there's a way to get this function thunking out of the way for just the ASM functions, but so far I haven't found a solution. Not even remotely.
Feel free to add your own thoughts and observations here - even if they are just speculative. I think it's still a highly interesting topic that needs much more investigation.
来源:https://stackoverflow.com/questions/55266411/calling-masm-proc-from-c-cli-in-x64-mode-yields-unexpected-performance-problem