Java Streams: How to do an efficient “distinct and sort”?

前端 未结 2 941
猫巷女王i
猫巷女王i 2020-12-01 15:46

Let\'s assume I\'ve got a Stream and want to get only distinct elements and sorted.

The naïve approach would be to do just the following:

相关标签:
2条回答
  • 2020-12-01 16:26

    Disclaimer: I know performance testing is hard and especially on the JVM with warmups needed and a controlled environment with no other processes running.

    If i test it I get these results, so it seems your implementation benefits parallel execution. (Running on i7 with 4 cores + hyperthreading).

    So ".distinct().sorted()" seems to be slower. As predicted/explained by Holger

    Round 1 (Warm up?)
    3938
    2449
    5747
    Round 2
    2834
    2620
    3984
    Round 3 Parallel
    831
    4343
    6346
    Round 4 Parallel
    825
    3309
    6339
    

    Using Code:

    package test.test;
    
    import java.util.Collections;
    import java.util.List;
    import java.util.Objects;
    import java.util.function.Predicate;
    import java.util.stream.Collectors;
    import java.util.stream.IntStream;
    
    public class SortDistinctTest {
    
        public static void main(String[] args) {
            IntStream range = IntStream.range(0, 6_000_000);
            List<Integer> collect = range.boxed().collect(Collectors.toList());
            Collections.shuffle(collect);
    
            long start = System.currentTimeMillis();
    
            System.out.println("Round 1 (Warm up?)");
            collect.stream().sorted().filter(noAdjacentDuplicatesFilter()).collect(Collectors.counting());
            long fst = System.currentTimeMillis();
            System.out.println(fst - start);
    
            collect.stream().sorted().distinct().collect(Collectors.counting());
            long snd = System.currentTimeMillis();
            System.out.println(snd - fst);
    
            collect.stream().distinct().sorted().collect(Collectors.counting());
            long end = System.currentTimeMillis();
            System.out.println(end - snd);
    
            System.out.println("Round 2");
            collect.stream().sorted().filter(noAdjacentDuplicatesFilter()).collect(Collectors.counting());
            fst = System.currentTimeMillis();
            System.out.println(fst - end);
    
            collect.stream().sorted().distinct().collect(Collectors.counting());
            snd = System.currentTimeMillis();
            System.out.println(snd - fst);
    
            collect.stream().distinct().sorted().collect(Collectors.counting());
            end = System.currentTimeMillis();
            System.out.println(end - snd);
    
            System.out.println("Round 3 Parallel");
            collect.stream().parallel().sorted().filter(noAdjacentDuplicatesFilter()).collect(Collectors.counting());
            fst = System.currentTimeMillis();
            System.out.println(fst - end);
    
            collect.stream().parallel().sorted().distinct().collect(Collectors.counting());
            snd = System.currentTimeMillis();
            System.out.println(snd - fst);
    
            collect.stream().parallel().distinct().sorted().collect(Collectors.counting());
            end = System.currentTimeMillis();
            System.out.println(end - snd);
    
            System.out.println("Round 4 Parallel");
            collect.stream().parallel().sorted().filter(noAdjacentDuplicatesFilter()).collect(Collectors.counting());
            fst = System.currentTimeMillis();
            System.out.println(fst - end);
    
            collect.stream().parallel().sorted().distinct().collect(Collectors.counting());
            snd = System.currentTimeMillis();
            System.out.println(snd - fst);
    
            collect.stream().parallel().distinct().sorted().collect(Collectors.counting());
            end = System.currentTimeMillis();
            System.out.println(end - snd);
    
        }
    
        public static Predicate<Object> noAdjacentDuplicatesFilter() {
            final Object[] previousValue = { new Object() };
    
            return value -> {
                final boolean takeValue = !Objects.equals(previousValue[0], value);
                previousValue[0] = value;
                return takeValue;
            };
    
        }
    
    }
    
    0 讨论(0)
  • 2020-12-01 16:38

    When you chain a distinct() operation after sorted(), the implementation will utilize the sorted nature of the data and avoid building an internal HashSet, which can be demonstrated by the following program

    public class DistinctAndSort {
        static int COMPARE, EQUALS, HASHCODE;
        static class Tracker implements Comparable<Tracker> {
            static int SERIAL;
            int id;
            Tracker() {
                id=SERIAL++/2;
            }
            public int compareTo(Tracker o) {
                COMPARE++;
                return Integer.compare(id, o.id);
            }
            public int hashCode() {
                HASHCODE++;
                return id;
            }
            public boolean equals(Object obj) {
                EQUALS++;
                return super.equals(obj);
            }
        }
        public static void main(String[] args) {
            System.out.println("adjacent sorted() and distinct()");
            Stream.generate(Tracker::new).limit(100)
                  .sorted().distinct()
                  .forEachOrdered(o -> {});
            System.out.printf("compareTo: %d, EQUALS: %d, HASHCODE: %d%n",
                              COMPARE, EQUALS, HASHCODE);
            COMPARE=EQUALS=HASHCODE=0;
            System.out.println("now with intermediate operation");
            Stream.generate(Tracker::new).limit(100)
                .sorted().map(x -> x).distinct()
                .forEachOrdered(o -> {});
            System.out.printf("compareTo: %d, EQUALS: %d, HASHCODE: %d%n",
                              COMPARE, EQUALS, HASHCODE);
        }
    }
    

    which will print

    adjacent sorted() and distinct()
    compareTo: 99, EQUALS: 99, HASHCODE: 0
    now with intermediate operation
    compareTo: 99, EQUALS: 100, HASHCODE: 200
    

    The intermediate operation, as simple as map(x -> x), can’t be recognized by the Stream implementation, hence, it must assume that the elements might not be sorted in respect to the mapping function’s result.

    There is no guaranty that this kind of optimization happens, however, it is reasonable to assume that the developers of the Stream implementation will not remove that optimization and even try to add more optimizations, so rolling your own implementation will prevent your code from benefiting from future optimizations.

    Further, what you have created is a “stateful predicate”, which is strongly discouraged, and, of course, will break when being used with a parallel stream.

    If you don’t trust the Stream API to perform this operation efficiently enough, you might be better off implementing this particular operation without the Stream API.

    0 讨论(0)
提交回复
热议问题