Array/Linked list: performance depends on the direction of traversal? [closed]

后端未结

关注

 2  1948

[愿得一人]

相关标签:

2条回答

天命终不由人

2021-02-05 16:50

I don't know the answer, but I would start with looking at the size of the generated bytecode. Since in the array case, the number of iterations is known (cnt is hardcoded and final), the compiler may have inlined some iterations, saving on the jump and comparisons instructions.

Also, if you know the basics of how a program works at the low-level layers, looking at the disassembled bytecode might give you some hints. Even if you are not fluent with assembler languages, it is not too hard to understand a simple program like yours (I was surprised at how much I could figure out the first time I saw some disassembled java code).

Hope this helps.

0 讨论(0)
发布评论:

提交评论
- 加载中...
萌比男神i

2021-02-05 17:00
Talking about PC hardware, early hardware prefetchers (say circa 2005) were better at detecting and prefetching forward accesses, but more recent hardware should be good at detecting both directions. If you are interested in mobile hardware, it is totally possible that it still implements basic forward-only prefetching.

Outside of proper prefetch implemented in the MMU, which actually detects access patterns, it is very common for hardware to get more than one cache line when a cache miss occurs. Often this takes the form of simply getting next cache line, in addition to the required one, when a miss occurs. This implementation would give the forward direction a big advantage by effectively halving the cache miss rate in that case (this assumes prefetching is ineffective).

Locally, on Core i7, I get slightly better results for the linked list version at ~3.3 ms for the whole iteration, vs 3.5 ms for the array version - when using the original program (which iterates the link list in reverse order of creation). So I don't see the same effect you did.

The inner loop for your test, checking the value of val, has a big impact. The current loop will cause a lot of mispredicts, unless the JIT compiler is smart enough to use CMOV or something similar. It seems that in my test, it was - since I got about 1 ns / iteration for small iteration counts that fit in L1. 1 ns (about 3 cycles) isn't consistent with a full branch mis-predict. When I changed it to do an unconditional val += msg.value1, the array version got a significant boost, even in 1,000,000 iteration case (which won't even fit in L3, probably).

Interestingly enough, the same transformation (val += msg.value1) made the linked list version slightly slower. With the transformation, the array version was considerably faster at small iteration counts (inside L2, and the two approaches were comparable outside). From caliper:
```
  length method         ns linear runtime
     100  ARRAY       63.7 =
     100 LINKED      190.1 =
    1000  ARRAY      725.7 =
    1000 LINKED     1788.5 =
 1000000  ARRAY  2904083.2 ===
 1000000 LINKED  3043820.4 ===
10000000  ARRAY 23160128.5 ==========================
10000000 LINKED 25748352.0 ==============================
```
The behavior for small iteration counts is easier to explain - the linked list, which has to use pointer chasing, has a data dependency between each iteration of the loop. That is, each iteration depends on the previous, because the address to load comes from the previous element. The array doesn't have this same data dependency - only the increment of i is dependent, and this is very fast (i is certainly in a register here). So the loop can be much better pipelined in the array case.
0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题

Array/Linked list: performance depends on the *direction* of traversal? [closed]

Array/Linked list: performance depends on the direction of traversal? [closed]