问题
There is decent literature about merging sorted files or say merging K sorted files. They all work on the theory that first element of each file is put in a Heap, then until the heap is empty poll that element, get another from the file from where this element was taken. This works as long as one record of each file can be put in a heap.
Now let us say I have N sorted files but I can only bring K records in the heap and K < N and let us say N = Kc where "c" is the multiplier implying that N is so large that it is some multiple of c. Clearly, it will require doing K way merge over and over until we only are left with K files and then we merge them as one last time into the final sort. How do I implement this and what will be the complexity of this?
回答1:
There are multiple examples of k-way merge written in Java. One is http://www.sanfoundry.com/java-program-k-way-merge-algorithm/.
To implement your merge, you just have to write a simple wrapper that continually scans your directory, feeding the thing files until there's only one left. The basic idea is:
while number of files > 1
fileList = Load all file names
i = 0
while i < fileList.length
filesToMerge = copy files i through i+k-1 from file list
merge(filesToMerge, output file name)
i += k
end while
end while
Complexity analysis
This is easier to think about if we assume that each file contains the same number of items.
You have to merge M files, each of which contains n items, but you can only merge k files at a time. So you have to do logk(M) passes. That is, if you have 1,024 files and you can only merge 16 at a time, then you'll make one pass that merges 16 files at a time, creating a total of 64 files. Then you'll make another pass that merges 16 files at a time, creating four files, and your final pass will merge those four files to create the output.
If you have k files, each of which contains n items, then complexity of merging them is O(n*k log2 k).
So in the first pass you do M/k merges, each of which has complexity O(nk log k). That's O((M/k) * n * k * log2 k), or O(Mn log k).
Now, each of your files contains nkk items, and you do M/k/k merges of k files each. So the second pass complexity is O((M/k2) n * k2 * log2 k). Simplified, that, too works out to O(Mn log k).
In the second pass, you do k merges, each of which has complexity O(nk). Note that in every pass you're working with M*n items. So each pass you do is O(Mn log k). And you're doing logk(M) passes. So the total complexity is: O(logk(M) * (Mn log k)), or
O((Mn log k) log M)
The assumption that every file contains the same number of items doesn't affect the asymptotic analysis because, as I've shown, every pass manipulates the same number of items: M*n.
回答2:
This is all my thoughts
I would do it in iteration. First I would go for p=floor(n/k) iteration to get p sorted file. Then continue doing this for p+n%k items, until p+n%k becomes less then k. And then finally will get the sorted file.
Does it make sense?
来源:https://stackoverflow.com/questions/47935884/merging-n-sorted-files-using-k-way-merge