Do sorted and distinct immediately process the stream?

后端 未结 2 1820
孤城傲影
孤城傲影 2021-02-07 12:24

Imagine I have something that looks like this:

Stream stream = Stream.of(2,1,3,5,6,7,9,11,10)
            .distinct()
            .sorted();
         


        
2条回答
  •  别那么骄傲
    2021-02-07 12:43

    You have asked a loaded question, implying that there had to be a choice between two alternatives.

    The stateful intermediate operations have to store data, in some cases up to the point of storing all elements before being able to pass an element downstream, but that doesn’t change the fact that this work is deferred until a terminal operation has been commenced.

    It’s also not correct to say that it has to “traverse the stream twice”. There are entirely different traversals going on, e.g. in the case of sorted(), first, the traversal of the source filling on internal buffer that will be sorted, second, the traversal of the buffer. In case of distinct(), no second traversal happens in the sequential processing, the internal HashSet is just used to determine whether to pass an element downstream.

    So when you run

    Stream stream = Stream.of(2,1,3,5,3)
        .peek(i -> System.out.println("source: "+i))
        .distinct()
        .peek(i -> System.out.println("distinct: "+i))
        .sorted()
        .peek(i -> System.out.println("sorted: "+i));
    System.out.println("commencing terminal operation");
    stream.forEachOrdered(i -> System.out.println("terminal: "+i));
    

    it prints

    commencing terminal operation
    source: 2
    distinct: 2
    source: 1
    distinct: 1
    source: 3
    distinct: 3
    source: 5
    distinct: 5
    source: 3
    sorted: 1
    terminal: 1
    sorted: 2
    terminal: 2
    sorted: 3
    terminal: 3
    sorted: 5
    terminal: 5
    

    showing that nothing happens before the terminal operation has been commenced and that elements from the source immediately pass the distinct() operation (unless being duplicates), whereas all elements are buffered in the sorted() operation before being passed downstream.

    It can further be shown that distinct() does not need to traverse the entire stream:

    Stream.of(2,1,1,3,5,6,7,9,2,1,3,5,11,10)
        .peek(i -> System.out.println("source: "+i))
        .distinct()
        .peek(i -> System.out.println("distinct: "+i))
        .filter(i -> i>2)
        .findFirst().ifPresent(i -> System.out.println("found: "+i));
    

    prints

    source: 2
    distinct: 2
    source: 1
    distinct: 1
    source: 1
    source: 3
    distinct: 3
    found: 3
    

    As explained and demonstrated by Jose Da Silva’s answer, the amount of buffering may change with ordered parallel streams, as partial results must be adjusted before they can get passed to downstream operations.

    Since these operations do not happen before the actual terminal operation is known, there are more optimizations possible than currently happen in OpenJDK (but may happen in different implementations or future versions). E.g. sorted().toArray() may use and return the same array or sorted().findFirst() may turn into a min(), etc.

提交回复
热议问题