Creating distinct list from existing list in Java 7 and 8?

主宰稳场 提交于 2019-12-02 22:28:04
Matt Coubrough

I've now MicroBenchmarked most of the proposed options from the excellent answers provided. Like most non-trivial performance related questions the answer as to which is best is "it depends".

All my testing was performed with JMH Java Microbenchmarking Harness.

Most of these tests were performed using JDK 1.8, although I performed some of the tests with JDK 1.7 too just to ensure that its performance wasn't too different (it was almost identical). I tested the following techniques taken from the answers supplied so far:


1. Java 8 Stream - The solution using stream() I had proprosed as a possibility if using Java8:

public List<Integer> testJava8Stream(List<Integer> listInts) {
    return listInts.stream().distinct().collect(Collectors.toList());
}

pros modern Java 8 approach, no 3rd party dependencies

cons Requires Java 8


2. Adding To List - The solution proposed by Victor2748 where a new list is constructed and added to, if and only if the list doesn't already contain the value. Note that I also preallocate the destination list at the size of the original (the max possible) to prevent any reallocations:

public List<Integer> testAddingToList(List<Integer> listInts) {
    List<Integer> listDistinctInts = new ArrayList<>(listInts.size());
    for(Integer i : listInts)
    {
        if( !listDistinctInts.contains(i) ) { listDistinctInts.add(i); }
    }
    return listDistinctInts;
}

pros Works in any Java version, no need to create a Set and then copy, no 3rd party deps

cons Needs to repeatedly check the List for existing values as we build it


3. GS Collections Fast (now Eclipse collections) - The solution proposed by Craig P. Motlin using the GS Collections library and their custom List type FastList:

public List<Integer> testGsCollectionsFast(FastList listFast)
{
    return listFast.distinct();
}

pros Reportedly very quick, simple expressive code, works in Java 7 and 8

cons Requires 3rd party library and a FastList rather than a regular List<Integer>


4. GS Collections Adapted - The FastList solution wasn't quite comparing like-for-like because it needed a FastList passed to the method rather than a good ol' ArrayList<Integer> so I also tested the adapter method Craig proposed:

public List<Integer> testGsCollectionsAdapted(List<Integer> listInts)
{
    return listAdapter.adapt(listInts).distinct();
}

pros Doesn't require a FastList, works in Java 7 and 8

cons Has to adapt List so may not perform as well, needs 3rd party library


5. Guava ImmutableSet - The method proposed by Louis Wasserman in comments, and by 卢声远 Shengyuan Lu in their answer using Guava:

public List<Integer> testGuavaImmutable(List<Integer> listInts)
{
    return ImmutableSet.copyOf(listInts).asList();
}

pros Reportedly very fast, works in Java 7 or 8

cons Returns an Immutable List, can't handle nulls in the input List, and requires 3rd party library


7. HashSet - My original idea (also recommended by EverV0id, ulix and Radiodef)

public List<Integer> testHashSet(List<Integer> listInts)
{
    return new ArrayList<Integer>(new HashSet<Integer>(listInts));
}

pros Works in Java 7 and 8, no 3rd party dependencies

cons Doesn't retain original order of list, has to construct set then copy to list.


6. LinkedHashSet - Since the HashSet solution didn't preserve the order of the Integers in the original list I also tested a version which uses LinkedHashSet to preserve order:

public List<Integer> testLinkedHashSet(List<Integer> listInts)
{
    return new ArrayList<Integer>(new LinkedHashSet<Integer>(listInts));
}

pros Retains original ordering, works in Java 7 and 8, no 3rd party dependencies

cons Unlikely to be as fast as regular HashSet approach


Results

Here are my results for various different sizes of listInts (results ordered from slowest to fastest):

1. taking distinct from ArrayList of 100,000 random ints between 0-50,000 (ie. big list, some duplicates)

Benchmark                Mode       Samples     Mean   Mean error    Units

AddingToList            thrpt        10        0.505        0.012    ops/s
Java8Stream             thrpt        10      234.932       31.959    ops/s
LinkedHashSet           thrpt        10      262.185       16.679    ops/s
HashSet                 thrpt        10      264.295       24.154    ops/s
GsCollectionsAdapted    thrpt        10      357.998       18.468    ops/s
GsCollectionsFast       thrpt        10      363.443       40.089    ops/s
GuavaImmutable          thrpt        10      469.423       26.056    ops/s

2. taking distinct from ArrayList of 1000 random ints between 0-50 (ie. medium list, many duplicates)

Benchmark                Mode       Samples     Mean   Mean error    Units

AddingToList            thrpt        10    32794.698     1154.113    ops/s
HashSet                 thrpt        10    61622.073     2752.557    ops/s
LinkedHashSet           thrpt        10    67155.865     1690.119    ops/s
Java8Stream             thrpt        10    87440.902    13517.925    ops/s
GsCollectionsFast       thrpt        10   103490.738    35302.201    ops/s
GsCollectionsAdapted    thrpt        10   143135.973     4733.601    ops/s
GuavaImmutable          thrpt        10   186301.330    13421.850    ops/s

3. taking distinct from ArrayList of 100 random ints between 0-100 (ie. small list, some duplicates)

Benchmark                Mode       Samples     Mean   Mean error    Units

AddingToList            thrpt        10   278435.085    14229.285    ops/s
Java8Stream             thrpt        10   397664.052    24282.858    ops/s
LinkedHashSet           thrpt        10   462701.618    20098.435    ops/s
GsCollectionsAdapted    thrpt        10   477097.125    15212.580    ops/s
GsCollectionsFast       thrpt        10   511248.923    48155.211    ops/s
HashSet                 thrpt        10   512003.713    25886.696    ops/s
GuavaImmutable          thrpt        10  1082006.560    18716.012    ops/s

4. taking distinct from ArrayList of 10 random ints between 0-50 (ie. tiny list, few duplicates)

Benchmark                Mode       Samples     Mean   Mean error    Units

Java8Stream             thrpt        10  2739774.758   306124.297    ops/s
LinkedHashSet           thrpt        10  3607479.332   150331.918    ops/s
HashSet                 thrpt        10  4238393.657   185624.358    ops/s
GsCollectionsAdapted    thrpt        10  5919254.755   495444.800    ops/s
GsCollectionsFast       thrpt        10  7916079.963  1708778.450    ops/s
AddingToList            thrpt        10  7931479.667   966331.036    ops/s
GuavaImmutable          thrpt        10  9021621.880   845936.861    ops/s

Conclusions

  • If you're only taking the distinct items from a list once, and the list isn't very long any of these methods should be adequate.

  • The most efficient general approaches came from the 3rd party libraries: GS Collections and Guava performed admirably.

  • You may need to consider the size of your list and the likely number of duplicates when selecting the most performant method.

  • The naive approach of adding to a new list only if the value isn't already in it works great for tiny lists, but as soon as you have more than a handful of values in the input list it performs the worst of the methods tried.

  • The Guava ImmutableSet.copyOf(listInts).asList() method works the fastest in most situations. But take note of the restrictions: the returned list is Immutable and the input list cannot contain nulls.

  • The HashSet method performs the best of the non 3rd party approaches and usually better than Java 8 streams, but reorders the integers (which may or may not be an issue depending on your use-case).

  • The LinkedHashSet approach keeps the ordering but unsurprisingly was usually worse than the HashSet method.

  • Both the HashSet and LinkedHashSet methods will perform worse when using lists of data types that have complex HashCode calculations, so do your own profiling if you're trying to select distinct Foos from a List<Foo>.

  • If you already have GS Collections as a dependency then it performs very well and is more flexible than the ImmutableList Guava approach. If you don't have it as a dependency, it's worth considering adding it if the performance of selecting distinct items is critical to the performance of your application.

  • Disappointingly Java 8 streams seemed to perform fairly poorly. There may be a better way to code the distinct() call than the way I used, so comments or other answers are of course welcome.

NB. I'm no expert at MicroBenchmarking, so if anyone finds flaws in my results or methodology please notify me and I'll endeavour to correct the Answer.

Craig P. Motlin

If you're using Eclipse Collections (formerly GS Collections), you can use the method distinct().

ListIterable<Integer> listInts = FastList.newListWith(1, 1, 3, 77, 2, 19, 77, 123, 14, 123);
Assert.assertEquals(
        FastList.newListWith(1, 3, 77, 2, 19, 123, 14),
        listInts.distinct());

The advantage of using distinct() instead of converting to a Set and then back to a List is that distinct() preserves the order of the original List, retaining the first occurrence of each element. It's implemented by using both a Set and a List.

MutableSet<T> seenSoFar = UnifiedSet.newSet();
int size = list.size();
for (int i = 0; i < size; i++)
{
    T item = list.get(i);
    if (seenSoFar.add(item))
    {
        targetCollection.add(item);
    }
}
return targetCollection;

If you cannot convert your original List into a GS Collections type, you can use ListAdapter to get the same API.

MutableList<Integer> distinct = ListAdapter.adapt(integers).distinct();

There's no way to avoid the creation of the Set. Still, UnifiedSet is more efficient than HashSet so there will be some speed benefit.

If all you want is the number of distinct items, it's more efficient to just create a set without creating the list.

Verify.assertSize(7, UnifiedSet.newSet(listInts));

Eclipse Collections 8.0 requires Java 8. Eclipse Collections 7.x works well with Java 8, but only requires Java 5.

Note: I am a committer for Eclipse collections.

You should try new LinkedList(new HashSet(listInts)).

Guava can be your choice:

ImmutableSet<Integer> set = ImmutableSet.copyOf(listInts);

The API is extremely optimized.

It is FASTER than listInts.stream().distinct() and new LinkedHashSet<>(listInts).

When adding a value to a listInts check:

int valueToAdd;
//...
if (!listInts.contains(valueToAdd)) {listInts.add(valueToAdd)}

if you have an existing list, use a for-each statement to copy all the values from that list, to a new one, that you want to be "distinct":

List<Integer> listWithRepeatedValues;
List<Integer> distinctList;
//...
for (Integer i : listWithRepeatedValues) {
    if (!listInts.contains(valueToAdd)) {distinctList.add(i);}
}

Don't worry. Using a HashSet is a pretty easy and efficient way to eliminate duplicates:

    Set<Integer> uniqueList = new HashSet<>();
    uniqueList.addAll(listInts);   // Add all elements eliminating duplicates

    for (int n : uniqueList)       // Check the results (in no particular order)
        System.out.println(n);

    System.out.println("Number distinct values: " + uniqueList.size());

In a more specific scenario, just in case the range of possible values is known, is not very large, while listInts is very large.
The most efficient way of counting the number of unique entries in the list that I can think of is:

    boolean[] counterTable = new boolean[124];
    int counter = 0;

    for (int n : listInts)
        if (!counterTable[n]) {
            counter++;
            counterTable[n] = true;
        }

    System.out.println("Number of distinct values: " + counter);

This should work:

yourlist.stream().map(your wrapper that overrides equals and hashchode method::new).distinct().map(wrapper defined above::method that returns the final output).collect(Collectors.toList());

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!