I read the following in a forum :
Merge sort is very efficient for immutable datastructures like linked lists
and
Quick sort works well for sorting in-place. In particular, most of the operations can be defined in terms of swapping pairs of elements in an array. To do that, however, you normally "walk" through the array with two pointers (or indexes, etc.) One starts at the beginning of the array and the other at the end. Both then work their way toward the middle (and you're done with a particular partition step when they meet). That's expensive with files, because files are oriented primarily toward reading in one direction, from beginning to end. Starting from the end and seeking backwards is usually relatively expensive.
At least in its simplest incarnation, merge sort is pretty much the opposite. The easy way to implement it only requires looking through the data in one direction, but involves breaking the data into two separate pieces, sorting the pieces, then merging them back together.
With a linked list, it's easy to take (for example) alternating elements in one linked list, and manipulate the links to create two linked lists from those same elements instead. With an array, rearranging elements so alternating elements go into separate arrays is easy if you're willing to create a copy as big as the original data, but otherwise rather more non-trivial.
Likewise, merging with arrays is easy if you merge elements from the source arrays into a new array with the data in order -- but to do it in place without creating a whole new copy of the data is a whole different story. With a linked list, merging elements together from two source lists into a single target list is trivial -- again, you just manipulate links, without copying elements.
As for using Quicksort to produce the sorted runs for an external merge sort, it does work, but it's (decidedly) sub-optimal as a rule. To optimize a merge-sort, you normally want to maximize the lengths of each sorted "run" as you produce it. If you simply read in the data that will fit in memory, Quicksort it and write it out, each run will be restricted to (a little less than) the size of the available memory.
You can do quite a bit better than that as a rule though. You start by reading in a block of data, but instead of using a Quicksort on it, you build a heap. Then, as you write each item out from the heap into the sorted "run" file, you read another item in from your input file. If it's larger than the item you just wrote to disk, you insert it into your existing heap, and repeat.
Items that are smaller (i.e., belong before items that have already been written) you keep separate, and build into a second heap. When (and only when) your first heap is empty, and the second heap has taken over all the memory, you quit writing items to the existing "run" file, and start on a new one.
Exactly how effective this will be depends on the initial order of the data. In the worst case (input sorted in inverse order) it does no good at all. In the best case (input already sorted) it lets you "sort" the data in a single run through the input. In an average case (input in random order) it lets you approximately double the length of each sorted run, which will typically improve speed by around 20-25% (though the percentage varies depending on how much larger your data is than the available memory).
Quicksort depends on being able to index into an array or similar structure. When that's possible, it's hard to beat Quicksort.
But you can't index directly into a linked list very quickly. That is, if myList
is a linked list, then myList[x]
, were it possible to write such syntax, would involve starting at the head of the list and following the first x
links. That would have to be done twice for every comparison that Quicksort makes, and that would get expensive real quick.
Same thing on disk: Quicksort would have to seek and read every item it wants to compare.
Merge sort is faster in these situations because it reads the items sequentially, typically making log2(N) passes over the data. There is much less I/O involved, and much less time spent following links in a linked list.
Quicksort is fast when the data fits into memory and can be addressed directly. Mergesort is faster when data won't fit into memory or when it's expensive to get to an item.
Note that large file sorts typically load as much as they can of a file into memory, Quicksort that and write it out to a temporary file, and repeat until it has gone through the entire file. At that point there is some number of blocks, each one of which is sorted, and the program then does a N-way merge to produce the sorted output.
A quicksort will move records in to the middle of the list. In order to move an item to index X, it has to start at 0 and iterate one record at a time.
A mergesort splits the list into several small lists and only ever compares the items head of the lists.
The setup for a merge sort is typically move expensive than the iterated required by a quicksort. However, when a list is sufficiently large, or the reads are expensive(like from a disk), the time it takes for the quicksort to iterate becomes a major factor.