Why filter() after flatMap() is “not completely” lazy in Java streams?

前端 未结 7 1157
误落风尘
误落风尘 2020-11-22 08:17

I have the following sample code:

System.out.println(
       \"Result: \" +
        Stream.of(1, 2, 3)
                .filter(i -> {
                             


        
相关标签:
7条回答
  • 2020-11-22 08:42

    TL;DR, this has been addressed in JDK-8075939 and fixed in Java 10 (and backported to Java 8 in JDK-8225328).

    When looking into the implementation (ReferencePipeline.java) we see the method [link]

    @Override
    final void forEachWithCancel(Spliterator<P_OUT> spliterator, Sink<P_OUT> sink) {
        do { } while (!sink.cancellationRequested() && spliterator.tryAdvance(sink));
    }
    

    which will be invoke for findFirst operation. The special thing to take care about is the sink.cancellationRequested() which allows to end the loop on the first match. Compare to [link]

    @Override
    public final <R> Stream<R> flatMap(Function<? super P_OUT, ? extends Stream<? extends R>> mapper) {
        Objects.requireNonNull(mapper);
        // We can do better than this, by polling cancellationRequested when stream is infinite
        return new StatelessOp<P_OUT, R>(this, StreamShape.REFERENCE,
                                     StreamOpFlag.NOT_SORTED | StreamOpFlag.NOT_DISTINCT | StreamOpFlag.NOT_SIZED) {
            @Override
            Sink<P_OUT> opWrapSink(int flags, Sink<R> sink) {
                return new Sink.ChainedReference<P_OUT, R>(sink) {
                    @Override
                    public void begin(long size) {
                        downstream.begin(-1);
                    }
    
                    @Override
                    public void accept(P_OUT u) {
                        try (Stream<? extends R> result = mapper.apply(u)) {
                            // We can do better that this too; optimize for depth=0 case and just grab spliterator and forEach it
                            if (result != null)
                                result.sequential().forEach(downstream);
                        }
                    }
                };
            }
        };
    }
    

    The method for advancing one item ends up calling forEach on the sub-stream without any possibility for earlier termination and the comment at the beginning of the flatMap method even tells about this absent feature.

    Since this is more than just an optimization thing as it implies that the code simply breaks when the sub-stream is infinite, I hope that the developers soon prove that they “can do better than this”…


    To illustrate the implications, while Stream.iterate(0, i->i+1).findFirst() works as expected, Stream.of("").flatMap(x->Stream.iterate(0, i->i+1)).findFirst() will end up in an infinite loop.

    Regarding the specification, most of it can be found in the

    chapter “Stream operations and pipelines” of the package specification:

    Intermediate operations return a new stream. They are always lazy;

    … Laziness also allows avoiding examining all the data when it is not necessary; for operations such as "find the first string longer than 1000 characters", it is only necessary to examine just enough strings to find one that has the desired characteristics without examining all of the strings available from the source. (This behavior becomes even more important when the input stream is infinite and not merely large.)

    Further, some operations are deemed short-circuiting operations. An intermediate operation is short-circuiting if, when presented with infinite input, it may produce a finite stream as a result. A terminal operation is short-circuiting if, when presented with infinite input, it may terminate in finite time. Having a short-circuiting operation in the pipeline is a necessary, but not sufficient, condition for the processing of an infinite stream to terminate normally in finite time.

    It’s clear that a short-circuiting operation doesn’t guaranty a finite time termination, e.g. when a filter doesn’t match any item the processing can’t complete, but an implementation which doesn’t support any termination in finite time by simply ignoring the short-circuiting nature of an operation is far off the specification.

    0 讨论(0)
  • 2020-11-22 08:46

    I agree with other people this is a bug opened at JDK-8075939. And since it's still not fixed more than one year later. I would like to recommend you: AbacusUtil

    N.println("Result: " + Stream.of(1, 2, 3).peek(N::println).first().get());
    
    N.println("-----------");
    
    N.println("Result: " + Stream.of(1, 2, 3)
                            .flatMap(i -> Stream.of(i - 1, i, i + 1))
                            .flatMap(i -> Stream.of(i - 1, i, i + 1))
                            .peek(N::println).first().get());
    
    // output:
    // 1
    // Result: 1
    // -----------
    // -1
    // Result: -1
    

    Disclosure: I'm the developer of AbacusUtil.

    0 讨论(0)
  • 2020-11-22 08:46

    Today I also stumbled up on this bug. Behavior is not so strait forward, cause simple case, like below, is working fine, but similar production code doesn't work.

     stream(spliterator).map(o -> o).flatMap(Stream::of).flatMap(Stream::of).findAny()
    

    For guys who cannot wait another couple years for migration to JDK-10 there is a alternative true lazy stream. It doesn't support parallel. It was dedicated for JavaScript translation, but it worked out for me, cause interface is the same.

    StreamHelper is collection based, but it is easy to adapt Spliterator.

    https://github.com/yaitskov/j4ts/blob/stream/src/main/java/javaemul/internal/stream/StreamHelper.java

    0 讨论(0)
  • 2020-11-22 08:48

    Unfortunately .flatMap() is not lazy. However, a custom flatMap workaround is available here: Why .flatMap() is so inefficient (non lazy) in java 8 and java 9

    0 讨论(0)
  • 2020-11-22 08:50

    In my free StreamEx library I introduced the short-circuiting collectors. When collecting sequential stream with short-circuiting collector (like MoreCollectors.first()) exactly one element is consumed from the source. Internally it's implemented in quite dirty way: using a custom exception to break the control flow. Using my library your sample could be rewritten in this way:

    System.out.println(
            "Result: " +
                    StreamEx.of(1, 2, 3)
                    .flatMap(i -> Stream.of(i - 1, i, i + 1))
                    .flatMap(i -> Stream.of(i - 1, i, i + 1))
                    .filter(i -> {
                        System.out.println(i);
                        return true;
                    })
                    .collect(MoreCollectors.first())
                    .get()
            );
    

    The result is the following:

    -1
    Result: -1
    
    0 讨论(0)
  • 2020-11-22 08:52

    The elements of the input stream are consumed lazily one by one. The first element, 1, is transformed by the two flatMaps into the stream -1, 0, 1, 0, 1, 2, 1, 2, 3, so that entire stream corresponds to just the first input element. The nested streams are eagerly materialized by the pipeline, then flattened, then fed to the filter stage. This explains your output.

    The above does not stem from a fundamental limitation, but it would probably make things much more complicated to get full-blown laziness for nested streams. I suspect it would be an even greater challenge to make it performant.

    For comparison, Clojure's lazy seqs get another layer of wrapping for each such level of nesting. Due to this design, the operations may even fail with StackOverflowError when nesting is exercised to the extreme.

    0 讨论(0)
提交回复
热议问题