I\'ve already read this and this questions, but still doubt whether the observed behavior of Stream.skip
was intended by JDK authors.
Let\'s have simple
Recall that the goal of stream flags (ORDERED, SORTED, SIZED, DISTINCT) is to enable operations to avoid doing unnecessary work. Examples of optimizations that involve stream flags are:
sorted()
is a no-op;toArray()
, avoiding a copy;Each stage of a pipeline has a set of stream flags. Intermediate operations can inject, preserve, or clear stream flags. For example, filtering preserves sorted-ness / distinct-ness but not sized-ness; mapping preserves sized-ness but not sorted-ness or distinct-ness. Sorting injects sorted-ness. The treatment of flags for intermediate operations is fairly straightforward, because all decisions are local.
The treatment of flags for terminal operations is more subtle. ORDERED is the most relevant flag for terminal ops. And if a terminal op is UNORDERED, then we do back-propagate the unordered-ness.
Why do we do this? Well, consider this pipeline:
set.stream()
.sorted()
.forEach(System.out::println);
Since forEach
is not constrained to operate in order, the work of sorting the list is completely wasted effort. So we back-propagate this information (until we hit a short-circuiting operation, such as limit
), so as not to lose this optimization opportunity. Similarly, we can use an optimized implementation of distinct
on unordered streams.
Is this behavior intended or it's a bug?
Yes :) The back-propagation is intended, as it is a useful optimization that should not produce incorrect results. However, the bug part is that we are propagating past a previous skip
, which we shouldn't. So the back-propagation of the UNORDERED flag is overly aggressive, and that's a bug. We'll post a bug.
If yes is it documented somewhere?
It should be just an implementation detail; if it were correctly implemented, you wouldn't notice (except that your streams are faster.)
@Ruben, you probably don't understand my question. Roughly the problem is: why unordered().collect(toCollection(HashSet::new)) behaves differently than collect(toSet()). Of course I know that toSet() is unordered.
Probably, but, anyway, I will give it a second try.
Having a look at the Javadocs of Collectors toSet and toCollection we can see that toSet delivers an unordered collector
This is an {@link Collector.Characteristics#UNORDERED unordered} Collector.
i.e., a CollectorImpl with the UNORDERED Characteristic. Having a look at the Javadoc of Collector.Characteristics#UNORDERED we can read:
Indicates that the collection operation does not commit to preserving the encounter order of input elements
In the Javadocs of Collector we can also see:
For concurrent collectors, an implementation is free to (but not required to) implement reduction concurrently. A concurrent reduction is one where the accumulator function is called concurrently from multiple threads, using the same concurrently-modifiable result container, rather than keeping the result isolated during accumulation. A concurrent reduction should only be applied if the collector has the {@link Characteristics#UNORDERED} characteristics or if the originating data is unordered
This means to me that, if we set the UNORDERED characteristic, we do not care at all about the order in which the elements of the stream get passed to the accumulator, and, therefore, the elements can be extracted from the pipeline in any order.
Btw, you get the same behavior if you omit the unordered() in your example:
System.out.println("skip-toSet: "
+ input.parallelStream().filter(x -> x > 0)
.skip(1)
.collect(Collectors.toSet()));
Furthermore, the skip() method in Stream gives us a hint:
While {@code skip()} is generally a cheap operation on sequential stream pipelines, it can be quite expensive on ordered parallel pipelines
and
Using an unordered stream source (such as {@link #generate(Supplier)}) or removing the ordering constraint with {@link #unordered()} may result in significant speedups
When using
Collectors.toCollection(HashSet::new)
you are creating a normal "ordered" Collector (one without the UNORDERED characteristic), what to me means that you do care about the ordering, and, therefore, the elements are being extracted in order and you get the expected behavior.