Talking about PC hardware, early hardware prefetchers (say circa 2005) were better at detecting and prefetching forward accesses, but more recent hardware should be good at detecting both directions. If you are interested in mobile hardware, it is totally possible that it still implements basic forward-only prefetching.
Outside of proper prefetch implemented in the MMU, which actually detects access patterns, it is very common for hardware to get more than one cache line when a cache miss occurs. Often this takes the form of simply getting next cache line, in addition to the required one, when a miss occurs. This implementation would give the forward direction a big advantage by effectively halving the cache miss rate in that case (this assumes prefetching is ineffective).
Locally, on Core i7, I get slightly better results for the linked list version at ~3.3 ms for the whole iteration, vs 3.5 ms for the array version - when using the original program (which iterates the link list in reverse order of creation). So I don't see the same effect you did.
The inner loop for your test, checking the value of val, has a big impact. The current loop will cause a lot of mispredicts, unless the JIT compiler is smart enough to use CMOV or something similar. It seems that in my test, it was - since I got about 1 ns / iteration for small iteration counts that fit in L1. 1 ns (about 3 cycles) isn't consistent with a full branch mis-predict. When I changed it to do an unconditional val += msg.value1, the array version got a significant boost, even in 1,000,000 iteration case (which won't even fit in L3, probably).
Interestingly enough, the same transformation (val += msg.value1) made the linked list version slightly slower. With the transformation, the array version was considerably faster at small iteration counts (inside L2, and the two approaches were comparable outside). From caliper:
length method ns linear runtime
100 ARRAY 63.7 =
100 LINKED 190.1 =
1000 ARRAY 725.7 =
1000 LINKED 1788.5 =
1000000 ARRAY 2904083.2 ===
1000000 LINKED 3043820.4 ===
10000000 ARRAY 23160128.5 ==========================
10000000 LINKED 25748352.0 ==============================
The behavior for small iteration counts is easier to explain - the linked list, which has to use pointer chasing, has a data dependency between each iteration of the loop. That is, each iteration depends on the previous, because the address to load comes from the previous element. The array doesn't have this same data dependency - only the increment of i is dependent, and this is very fast (i is certainly in a register here). So the loop can be much better pipelined in the array case.