C++ vs Java? Why does the ICC generate slower code than VC? [closed]

前端未结

关注

 7  1436

忘了有多久

相关标签:

7条回答

萌比男神i

2021-01-31 15:04

when you have eliminated the impossible, whatever remains, however improbable, must be the truth.

You've got some data in one hand, and an assumption (C++ is always faster than Java) in the other. Why ask for people to justify your assumption when the data tells you otherwise?

If you wish to obtain assembly from the JVM in order to compare what's being run then the commandline option is '-XX:+PrintOptoAssembly', but you'll need to download a debug jvm in order to do so. Looking at the assembly would at least tell you why one is faster than the other.

0 讨论(0)
发布评论:

提交评论
- 加载中...
难免孤独

2021-01-31 15:05
i see you are running the following loop
```
for(int i = 0; i < 1024 * 1024; i++){
        for(int x = 0; x < 1024; x++){
            var += arr[x];
        }
    }
```
twice in the Java code; while once in the c++ code; this might bring a caches warmup which makes the Java code finally execute faster than the C++.
0 讨论(0)
发布评论:

提交评论
- 加载中...
醉梦人生

2021-01-31 15:06
I suspect the culprit is simple loop unrolling. Replace
```
var += arrPtr[x];
```
with
```
var += arrPtr[x++];
var += arrPtr[x++];
var += arrPtr[x++];
var += arrPtr[x];
```
and observe how much faster the C++ version runs.
0 讨论(0)
发布评论:

提交评论
- 加载中...
不要未来只要你来

2021-01-31 15:07

I think Java compiler implements JITC (Just in time compilation, or some more recent technology) to approach native compilers speed, and could infer that your array doesn't change, and thus could apply constant folding to the inner loop...

0 讨论(0)
发布评论:

提交评论
- 加载中...

野的像风

2021-01-31 15:22

tl;dr: What you're seeing here seems to be ICC's failed attempt at vectorizing the loop.

Let's start with MSVC x64:

Here's the critical loop:

$LL3@main:
movsxd  rax, DWORD PTR [rdx-4]
movsxd  rcx, DWORD PTR [rdx-8]
add rdx, 16
add r10, rax
movsxd  rax, DWORD PTR [rdx-16]
add rbx, rcx
add r9, rax
movsxd  rax, DWORD PTR [rdx-12]
add r8, rax
dec r11
jne SHORT $LL3@main

What you see here is the standard loop unrolling by the compiler. MSVC is unrolling to 4 iterations, and splitting the var variable across four registers: r10, rbx, r9, and r8. Then at the end of the loop, these 4 registers are summed up back together.

Here's where the 4 sums are recombined:

lea rax, QWORD PTR [r8+r9]
add rax, r10
add rbx, rax
dec rdi
jne SHORT $LL6@main

Note that MSVC currently does not do automatic vectorization.

Now let's look at part of your ICC output:

000000013F0510A2  movq        xmm2,mmword ptr arr[rcx]  
000000013F0510A8  add         r8,8  
000000013F0510AC  punpckldq   xmm2,xmm2  
000000013F0510B0  add         rcx,20h  
000000013F0510B4  movdqa      xmm3,xmm2  
000000013F0510B8  pand        xmm2,xmm0  
000000013F0510BC  movq        xmm4,mmword ptr [rdx+8]  
000000013F0510C1  psrad       xmm3,1Fh  
000000013F0510C6  punpckldq   xmm4,xmm4  
000000013F0510CA  pand        xmm3,xmm1  
000000013F0510CE  por         xmm3,xmm2  
000000013F0510D2  movdqa      xmm5,xmm4  
000000013F0510D6  movq        xmm2,mmword ptr [rdx+10h]  
000000013F0510DB  psrad       xmm5,1Fh  
000000013F0510E0  punpckldq   xmm2,xmm2  
000000013F0510E4  pand        xmm5,xmm1  
000000013F0510E8  paddq       xmm6,xmm3  

...

What you're seeing here is an attempt by ICC to vectorize this loop. This is done in a similar manner as what MSVC did (splitting into multiple sums), but using SSE registers instead and with two sums per register.

But it turns out that the overhead of vectorization happens to outweigh the benefits of vectorizing.

If we walk these instructions down one-by-one, we can see how ICC tries to vectorize it:

//  Load two ints using a 64-bit load.  {x, y, 0, 0}
movq        xmm2,mmword ptr arr[rcx]  

//  Shuffle the data into this form.
punpckldq   xmm2,xmm2           xmm2 = {x, x, y, y}
movdqa      xmm3,xmm2           xmm3 = {x, x, y, y}

//  Mask out index 1 and 3.
pand        xmm2,xmm0           xmm2 = {x, 0, y, 0}

//  Arithmetic right-shift to copy sign-bit across the word.
psrad       xmm3,1Fh            xmm3 = {sign(x), sign(x), sign(y), sign(y)}

//  Mask out index 0 and 2.
pand        xmm3,xmm1           xmm3 = {0, sign(x), 0, sign(y)}

//  Combine to get sign-extended values.
por         xmm3,xmm2           xmm3 = {x, sign(x), y, sign(y)}
                                xmm3 = {x, y}

//  Add to accumulator...
paddq       xmm6,xmm3

So it's doing some very messy unpacking just to vectorize. The mess comes from needing to sign-extend the 32-bit integers to 64-bit using only SSE instructions.

SSE4.1 actually provides the PMOVSXDQ instruction for this purpose. But either the target machine doesn't support SSE4.1, or ICC isn't smart enough to use it in this case.

But the point is:

The Intel compiler is trying to vectorize the loop. But the overhead added seems to outweigh the benefit of vectorizing it in the first place. Hence why it's slower.

EDIT : Update with OP's results on:

ICC x64 no vectorization
ICC x86 with vectorization

You changed the data-type to double. So now it's floating-point. There's no more of that ugly sign-fill shifts that were plaguing the integer version.

But since you disabled vectorization for the x64 version, it obviously becomes slower.

ICC x86 with vectorization:

00B8109E  addpd       xmm0,xmmword ptr arr[edx*8]  
00B810A4  addpd       xmm1,xmmword ptr [esp+edx*8+40h]  
00B810AA  addpd       xmm0,xmmword ptr [esp+edx*8+50h]  
00B810B0  addpd       xmm1,xmmword ptr [esp+edx*8+60h]  
00B810B6  add         edx,8  
00B810B9  cmp         edx,400h  
00B810BF  jb          wmain+9Eh (0B8109Eh)

Not much here - standard vectorization + 4x loop-unrolling.

ICC x64 with no vectorization:

000000013FC010B2  lea         ecx,[rdx+rdx]  
000000013FC010B5  inc         edx  
000000013FC010B7  cmp         edx,200h  
000000013FC010BD  addsd       xmm6,mmword ptr arr[rcx*8]  
000000013FC010C3  addsd       xmm6,mmword ptr [rsp+rcx*8+58h]  
000000013FC010C9  jb          wmain+0B2h (13FC010B2h)

No vectorization + only 2x loop-unrolling.

All things equal, disabling vectorization will hurt performance in this floating-point case.

0 讨论(0)

爱一瞬间的悲伤

2021-01-31 15:22

Just for the record, I ran both codes on my box (x86_64 linux), the C++ with std::array, a plain int[1024] and, for completeness also with long instead of int. Java (open-jdk 1.6) clocked it at 3.8s, C++ (int) at 3.37s, and C++ (long) at 3.9s. My compiler was g++ 4.5.1. Maybe it's just Intel's compiler that's not as good as thought.

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页

热议问题