tl;dr: What you're seeing here seems to be ICC's failed attempt at vectorizing the loop.
Let's start with MSVC x64:
Here's the critical loop:
$LL3@main:
movsxd rax, DWORD PTR [rdx-4]
movsxd rcx, DWORD PTR [rdx-8]
add rdx, 16
add r10, rax
movsxd rax, DWORD PTR [rdx-16]
add rbx, rcx
add r9, rax
movsxd rax, DWORD PTR [rdx-12]
add r8, rax
dec r11
jne SHORT $LL3@main
What you see here is the standard loop unrolling by the compiler. MSVC is unrolling to 4 iterations, and splitting the var
variable across four registers: r10
, rbx
, r9
, and r8
. Then at the end of the loop, these 4 registers are summed up back together.
Here's where the 4 sums are recombined:
lea rax, QWORD PTR [r8+r9]
add rax, r10
add rbx, rax
dec rdi
jne SHORT $LL6@main
Note that MSVC currently does not do automatic vectorization.
Now let's look at part of your ICC output:
000000013F0510A2 movq xmm2,mmword ptr arr[rcx]
000000013F0510A8 add r8,8
000000013F0510AC punpckldq xmm2,xmm2
000000013F0510B0 add rcx,20h
000000013F0510B4 movdqa xmm3,xmm2
000000013F0510B8 pand xmm2,xmm0
000000013F0510BC movq xmm4,mmword ptr [rdx+8]
000000013F0510C1 psrad xmm3,1Fh
000000013F0510C6 punpckldq xmm4,xmm4
000000013F0510CA pand xmm3,xmm1
000000013F0510CE por xmm3,xmm2
000000013F0510D2 movdqa xmm5,xmm4
000000013F0510D6 movq xmm2,mmword ptr [rdx+10h]
000000013F0510DB psrad xmm5,1Fh
000000013F0510E0 punpckldq xmm2,xmm2
000000013F0510E4 pand xmm5,xmm1
000000013F0510E8 paddq xmm6,xmm3
...
What you're seeing here is an attempt by ICC to vectorize this loop. This is done in a similar manner as what MSVC did (splitting into multiple sums), but using SSE registers instead and with two sums per register.
But it turns out that the overhead of vectorization happens to outweigh the benefits of vectorizing.
If we walk these instructions down one-by-one, we can see how ICC tries to vectorize it:
// Load two ints using a 64-bit load. {x, y, 0, 0}
movq xmm2,mmword ptr arr[rcx]
// Shuffle the data into this form.
punpckldq xmm2,xmm2 xmm2 = {x, x, y, y}
movdqa xmm3,xmm2 xmm3 = {x, x, y, y}
// Mask out index 1 and 3.
pand xmm2,xmm0 xmm2 = {x, 0, y, 0}
// Arithmetic right-shift to copy sign-bit across the word.
psrad xmm3,1Fh xmm3 = {sign(x), sign(x), sign(y), sign(y)}
// Mask out index 0 and 2.
pand xmm3,xmm1 xmm3 = {0, sign(x), 0, sign(y)}
// Combine to get sign-extended values.
por xmm3,xmm2 xmm3 = {x, sign(x), y, sign(y)}
xmm3 = {x, y}
// Add to accumulator...
paddq xmm6,xmm3
So it's doing some very messy unpacking just to vectorize. The mess comes from needing to sign-extend the 32-bit integers to 64-bit using only SSE instructions.
SSE4.1 actually provides the PMOVSXDQ
instruction for this purpose. But either the target machine doesn't support SSE4.1, or ICC isn't smart enough to use it in this case.
But the point is:
The Intel compiler is trying to vectorize the loop. But the overhead added seems to outweigh the benefit of vectorizing it in the first place. Hence why it's slower.
EDIT : Update with OP's results on:
- ICC x64 no vectorization
- ICC x86 with vectorization
You changed the data-type to double
. So now it's floating-point. There's no more of that ugly sign-fill shifts that were plaguing the integer version.
But since you disabled vectorization for the x64 version, it obviously becomes slower.
ICC x86 with vectorization:
00B8109E addpd xmm0,xmmword ptr arr[edx*8]
00B810A4 addpd xmm1,xmmword ptr [esp+edx*8+40h]
00B810AA addpd xmm0,xmmword ptr [esp+edx*8+50h]
00B810B0 addpd xmm1,xmmword ptr [esp+edx*8+60h]
00B810B6 add edx,8
00B810B9 cmp edx,400h
00B810BF jb wmain+9Eh (0B8109Eh)
Not much here - standard vectorization + 4x loop-unrolling.
ICC x64 with no vectorization:
000000013FC010B2 lea ecx,[rdx+rdx]
000000013FC010B5 inc edx
000000013FC010B7 cmp edx,200h
000000013FC010BD addsd xmm6,mmword ptr arr[rcx*8]
000000013FC010C3 addsd xmm6,mmword ptr [rsp+rcx*8+58h]
000000013FC010C9 jb wmain+0B2h (13FC010B2h)
No vectorization + only 2x loop-unrolling.
All things equal, disabling vectorization will hurt performance in this floating-point case.