Example: List 1: [1, 4, 5, 8, 9]
List 2: [3, 4, 4, 6]
List 3: [0, 2, 8]
Would yield the following result:
Iterator -> [0, 1, 2, 3, 4, 4, 4, 5, 6, 8
There are basically three different ways to merge multiple sorted lists:
In the discussion below, n
refers to the total number of items in all lists combined. k
refers to the number of lists.
Case 1 is the easiest to envision, but also the least efficient. Imagine you're given four lists, A, B, C, and D. With this method, you merge A and B to create AB. Then you merge AB and C to create ABC. Finally, you merge ABC with D to create ABCD. The complexity of this algorithm approaches O(n*k). You iterate over A and B three times, C two times, and D one time.
The divide and conquer solution is to merge A and B to create AB. Then merge C and D to create CD. Then merge AB and CD to create ABCD. In the best case, which occurs when the lists have similar numbers of items, this method is O(n * log(k)). But if the lists' lengths vary widely, this algorithm's running time can approach O(n*k).
For more information about these two algorithms, see my blog entry, A closer look at pairwise merging. For more details about the divide and conquer approach specifically, see A different way to merge multiple lists.
The priority queue based merge works as follows:
Create a priority queue to hold the iterator for each list
while the priority queue is not empty
Remove the iterator that references the smallest current number
Output the referenced value
If not at end of iterator
Add the iterator back to the queue
This algorithm is proven to be O(n * log(k)) in the worst case. You can see that every item in every list is added to the priority queue exactly once, and removed from the priority queue exactly once. But the queue only contains k
items at any time. So the memory requirements are very small.
The implementation of iterators in Java makes the priority queue implementation slightly inconvenient, but it's easily fixed with some helper classes. Most importantly, we need an iterator that lets us peek at the next item without consuming it. I call this a PeekableIterator
, which looks like this:
// PeekableIterator is an iterator that lets us peek at the next item
// without consuming it.
public class PeekableIterator<E> implements Iterator<E> {
private final Iterator<E> iterator;
private E current;
private boolean hasCurrent;
public PeekableIterator(Iterator<E> iterator) {
this.iterator = iterator;
if (iterator.hasNext()) {
current = iterator.next();
hasCurrent = true;
}
else {
hasCurrent = false;
}
}
public E getCurrent() {
// TODO: Check for current item
return current;
}
public boolean hasNext() {
return hasCurrent;
}
public E next() {
// TODO: Error check to see if there is a current
E rslt = current;
if (iterator.hasNext()) {
current = iterator.next();
}
else {
hasCurrent = false;
}
return rslt;
}
public void remove() {
iterator.remove();
}
Then, since the priority queue will hold iterators rather than individual items, we need a comparator that will compare the current items of two PeekableIterator
interfaces. That's easy enough to create:
// IteratorComparator lets us compare the next items for two PeekableIterator instances.
public class IteratorComparator<E> implements Comparator<PeekableIterator<E>> {
private final Comparator<E> comparator;
public IteratorComparator(Comparator<E> comparator) {
this.comparator = comparator;
}
public int compare(PeekableIterator<E> t1, PeekableIterator<E> t2) {
int rslt = comparator.compare(t1.getCurrent(), t2.getCurrent());
return rslt;
}
}
Those two classes are more formal implementations of the code you wrote to get and compare the next items for individual iterators.
Finally, the MergeIterator
initializes a PriorityQueue<PeekableIterator>
so that you can call the hasNext
and next
methods to iterate over the merged lists:
// MergeIterator merges items from multiple sorted iterators
// to produce a single sorted sequence.
public class MergeIterator<E> implements Iterator<E> {
private final IteratorComparator<E> comparator;
private final PriorityQueue<PeekableIterator<E>> pqueue;
// call with an array or list of sequences to merge
public MergeIterator(List<Iterator<E>> iterators, Comparator<E> comparator) {
this.comparator = new IteratorComparator<E>(comparator);
// initial capacity set to 11 because that's the default,
// and there's no constructor that lets me supply a comparator without the capacity.
pqueue = new PriorityQueue<PeekableIterator<E>>(11, this.comparator);
// add iterators to the priority queue
for (Iterator<E> iterator : iterators) {
// but only if the iterator actually has items
if (iterator.hasNext())
{
pqueue.offer(new PeekableIterator(iterator));
}
}
}
public boolean hasNext() {
return pqueue.size() > 0;
}
public E next() {
PeekableIterator<E> iterator = pqueue.poll();
E rslt = iterator.next();
if (iterator.hasNext()) {
pqueue.offer(iterator);
}
return rslt;
}
public void remove() {
// TODO: Throw UnsupportedOperationException
}
}
I've created a little test program to demonstrate how this works:
private void DoIt() {
String[] a1 = new String[] {"apple", "cherry", "grape", "peach", "strawberry"};
String[] a2 = new String[] {"banana", "fig", "orange"};
String[] a3 = new String[] {"cherry", "kumquat", "pear", "pineapple"};
// create an ArrayList of iterators that we can pass to the
// MergeIterator constructor.
ArrayList<Iterator<String>> iterators = new ArrayList<Iterator<String>> (
Arrays.asList(
Arrays.asList(a1).iterator(),
Arrays.asList(a2).iterator(),
Arrays.asList(a3).iterator())
);
// String.CASE_INSENSITIVE_ORDER is a Java 8 way to get
// a String comparator. If there's a better way to do this,
// I don't know what it is.
MergeIterator<String> merger = new MergeIterator(iterators, String.CASE_INSENSITIVE_ORDER);
while (merger.hasNext())
{
String s = merger.next();
System.out.println(s);
}
}
My performance comparisons of the divide-and-conquer and priority queue merges shows that the divide-and-conquer approach can be faster than using the priority queue, depending on the cost of comparisons. When comparisons are cheap (primitive types, for example), the pairwise merge is faster even though it does more work. As key comparisons become more expensive (like comparing strings), the priority queue merge has the advantage because it performs fewer comparisons.
More importantly, the pairwise merge requires twice the memory of the priority queue approach. My implementation used a FIFO queue, but even if I built a tree the pairwise merge would require more memory. Also, as your code shows, you still need the PeekableIterator
and IteratorComparator
classes (or something similar) if you want to implement the pairwise merge.
See Testing merge performance for more details about the relative performance of these two methods.
For the reasons I detailed above, I conclude that the priority queue merge is the best way to go.