问题
From this question " Will inner parallel streams be processed fully in parallel before considering parallelizing outer stream?", I understood that streams perform work-stealing. However, I've noticed that it often doesn't seem to occur. For example, if I have a List of say 100,000 elements and I attempt to process it in parallelStream() fashion, I often notice towards the end that most of my CPU cores are sitting idle in the "waiting" state. (Note: Of the 100,000 elements in the list, some elements take a long time to process, whereas others are fast; and, the list is not balanced, which is why some threads may get "unlucky" and have lots to do, whereas others get lucky and have little to do).
So, my theory is that JIT compiler does an initial division of the 100,000 elements into the 16 threads (because I have 16 cores), but then within each thread, it just does a simple (sequential) for-loop (as that would be the most efficient) and therefore no work stealing would ever occurr (which is what I'm seeing).
I think the reason why Will inner parallel streams be processed fully in parallel before considering parallelizing outer stream? showed work stealing is that there was an OUTER loop that was streaming and an INNER LOOP that was streaming, and so in that case, each inner loop got evaluated at run time and would create new tasks that could, at runtime, be assigned to "idle" threads. Thoughts? Is there something I'm doing wrong that would "force" a simple list.parallelStream() to use work-stealing? (My current workaround is to attempt to balance the list based on various heurestics so that each thread sees, usually, the same amount of work; but, it's hard to predict that....)
回答1:
This has nothing to do with the JIT compiler but with the implementation of the Stream API. It will divide the workload into chunks which are the processed sequentially by the worker threads. The general strategy is to have more jobs than worker threads to enable work-stealing, see for example ForkJoinTask.getSurplusQueuedTaskCount(), which can be used to implement such an adaptive strategy.
The following code can be used to detect how many elements were processed sequentially when the source is an ArrayList
:
List<Object> list = new ArrayList<>(Collections.nCopies(10_000, ""));
System.out.println(System.getProperty("java.version"));
System.out.println(Runtime.getRuntime().availableProcessors());
System.out.println( list.parallelStream()
.collect(
() -> new ArrayList<>(Collections.singleton(0)),
(l,x) -> l.replaceAll(i -> i + 1),
List::addAll) );
On my current test machine, it prints:
1.8.0_60
4
[625, 625, 625, 625, 625, 625, 625, 625, 625, 625, 625, 625, 625, 625, 625, 625]
So there are more chunks than cores, to allow work-stealing. However, once the sequential processing of a chunk has started, it can’t be split further, so this implementation has limitations when the per-element execution times differ significantly. This is always a trade-off.
来源:https://stackoverflow.com/questions/50283041/a-simple-list-parallelstream-in-java-8-stream-does-not-seem-to-do-work-stealin