Why is 'add' taking so long in my application?
问题 I'm profiling an application using Intel VTune, and there is one particular hotspot where I'm copying a __m128i member variable in the copy constructor of a C++ class. VTune gives this breakdown: Instruction CPU Time: Total CPU Time: Self Block 1: vmovdqa64x (%rax), %xmm0 4.1% 0.760s add $0x10, %rax 46.6% 8.594s Block 2: vmovapsx %xmm0, -10x(%rdx) 6.5% 1.204s (If it matters, compiler is gcc 7.4.0) I admit I'm an assembly noob, but it's very surprising that one particular add instruction is