问题
I'm profiling an application using Intel VTune, and there is one particular hotspot where I'm copying a __m128i
member variable in the copy constructor of a C++ class.
VTune gives this breakdown:
Instruction CPU Time: Total CPU Time: Self
Block 1:
vmovdqa64x (%rax), %xmm0 4.1% 0.760s
add $0x10, %rax 46.6% 8.594s
Block 2:
vmovapsx %xmm0, -10x(%rdx) 6.5% 1.204s
(If it matters, compiler is gcc 7.4.0)
I admit I'm an assembly noob, but it's very surprising that one particular add instruction is taking up 46% of my application time, given that the app is doing lots of other complex things and add is such a trivial operation.
Am I misinterpreting the profiling output somehow? Is there a path to optimize this other than "copy that variable less"?
来源:https://stackoverflow.com/questions/60232283/why-is-add-taking-so-long-in-my-application