Sorting a List in parallel without creating a temporary array in Java 8

后端 未结 5 805
小鲜肉
小鲜肉 2020-12-03 13:13

Java 8 provides java.util.Arrays.parallelSort, which sorts arrays in parallel using the fork-join framework. But there\'s no corresponding Collections.parallelSort

相关标签:
5条回答
  • 2020-12-03 13:44

    By combining the existing answers I came up with this code.
    This works if you are not interested in creating a custom List class and if you don't bother to create a temporary array (Collections.sort is doing it anyway).
    This uses the initial list and does not create a new one as in the parallelStream solution.

    // Convert List to Array so we can use Arrays.parallelSort rather than Collections.sort.
    // Note that Collections.sort begins with this very same conversion, so we're not adding overhead
    // in comparaison with Collections.sort.
    Foo[] fooArr = fooLst.toArray(new Foo[0]);
    
    // Multithread the TimSort. Automatically fallback to mono-thread when size is less than 8192.
    Arrays.parallelSort(fooArr, Comparator.comparingStuff(Foo::yourmethod));
    
    // Refill the List using the sorted Array, the same way Collections.sort does it.
    ListIterator<Foo> i = fooLst.listIterator();
    for (Foo e : fooArr) {
        i.next();
        i.set((Foo) e);
    }
    
    0 讨论(0)
  • 2020-12-03 13:52

    Just speculating here, but I see several good reasons for generic sort algorithms preferring to work on arrays instead of List instances:

    • Element access is performed via method calls. Despite all the optimizations JIT can apply, even for a list that implements RandomAccess, this probably means a lot of overhead compared to plain array accesses which can be optimized very well.
    • Many algorithms require copying some fragments of the array to temporary structures. There are efficient methods for copying arrays or their fragments. An arbitrary List instance on the other hand, can't be easily copied. New lists would have to be allocated which poses two problems. First, this means allocating some new objects which is likely more costly than allocating arrays. Second, the algorithm would have to choose what implementation of List should be allocated for this temporary structure. There are two obvious solutions, both bad: either just choose some hard-coded implementation, e.g. ArrayList, but then it could just allocate simple arrays as well (and if we're generating arrays then it's much easier if the soiurce is also an array). Or, let the user provide some list factory object, which makes the code much more complicated.
    • Related to the previous issue: there is no obvious way of copying a list into another due to how the API is designed. The best the List interface offers is addAll() method, but this is probably not efficient for most cases (think of pre-allocating the new list to its target size vs adding elements one by one which many implementations do).
    • Most lists that need to be sorted will be small enough for another copy to not be an issue.

    So probably the designers thought of CPU efficiency and code simplicity most of all, and this is easily achieved when the API accepts arrays. Some languages, e.g. Scala, have sort methods that work directly on lists, but this comes at a cost and probably is less efficient than sorting arrays in many cases (or sometimes there will probably just be a conversion to and from array performed behind the scenes).

    0 讨论(0)
  • 2020-12-03 13:54

    There doesn't appear to be any straightforward way to sort a List in parallel in Java 8. I don't think this is fundamentally difficult; it looks more like an oversight to me.

    The difficulty with a hypothetical Collections.parallelSort(list, cmp) is that the Collections implementation knows nothing about the list's implementation or its internal organization. This can be seen by examining the Java 7 implementation of Collections.sort(list, cmp). As you observed, it has to copy the list elements out to an array, sort them, and then copy them back into the list.

    This is the big advantage of the List.sort(cmp) extension method over Collections.sort(list, cmp). It might seem that this is merely a small syntactic advantage being able to write myList.sort(cmp) instead of Collections.sort(myList, cmp). The difference is that myList.sort(cmp), being an interface extension method, can be overridden by the specific List implementation. For example, ArrayList.sort(cmp) sorts the list in-place using Arrays.sort() whereas the default implementation implements the old copyout-sort-copyback technique.

    It should be possible to add a parallelSort extension method to the List interface that has similar semantics to List.sort but does the sorting in parallel. This would allow ArrayList to do a straightforward in-place sort using Arrays.parallelSort. (It's not entirely clear to me what the default implementation should do. It might still be worth it to do copyout-parallelSort-copyback.) Since this would be an API change, it can't happen until the next major release of Java SE.

    As for a Java 8 solution, there are a couple workarounds, none very pretty (as is typical of workarounds). You could create your own array-based List implementation and override sort() to sort in parallel. Or you could subclass ArrayList, override sort(), grab the elementData array via reflection and call parallelSort() on it. Of course you could just write your own List implementation and provide a parallelSort() method, but the advantage of overriding List.sort() is that this works on the plain List interface and you don't have to modify all the code in your code base to use a different List subclass.

    0 讨论(0)
  • 2020-12-03 13:54

    Use the following:

    yourCollection.parallelStream().sorted().collect(Collectors.toList());
    

    This will be parallel when sorting, because of parallelStream(). I believe this is what you mean by parallel sort?

    0 讨论(0)
  • 2020-12-03 14:06

    I think you are doomed to use a custom List implementation augmented with your own parallelSort or else change all your other code to store the big data in Array types.

    This is the inherent problem with layers of abstract data types. They're meant to isolate the programmer from details of implementation. But when the details of implementation matter - as in the case of underlying storage model for sort - the otherwise splendid isolation leaves the programmer helpless.

    The standard List sort documents provide an example. After the explanation that mergesort is used, they say

    The default implementation obtains an array containing all elements in this list, sorts the array, and iterates over this list resetting each element from the corresponding position in the array. (This avoids the n2 log(n) performance that would result from attempting to sort a linked list in place.)

    In other words, "since we don't know the underlying storage model for a List and couldn't touch it if we did, we make a copy organized in a known way." The parenthesized expression is based on the fact that the List "i'th element accessor" on a linked list is Omega(n), so the normal array mergesort implemented with it would be a disaster. In fact it's easy to implement mergesort efficiently on linked lists. The List implementer is just prevented from doing it.

    A parallel sort on List has the same problem. The standard sequential sort fixes it with custom sorts in the concrete List implementations. The Java folks just haven't chosen to go there yet. Maybe in Java 9.

    0 讨论(0)
提交回复
热议问题