I have been asked in my algorithm class to make a K-way merge algorithm which is of O(nlogk)
After searching i found it could be done via making a k length prio
I wrote a series of articles about this some years ago when discussing sorting a large text file. The idea is that the items you put on the heap contain not just the value but also the list the value came from. Or, you could just put the list reference on the heap and have the comparison function compare against the first item in the particular list.
See http://www.informit.com/guides/content.aspx?g=dotnet&seqNum=676 and http://www.informit.com/guides/content.aspx?g=dotnet&seqNum=677 for an explanation of the basic algorithm using a sequential list in place of a heap. See http://www.informit.com/guides/content.aspx?g=dotnet&seqNum=680 for an improved version that uses a heap.
As I said in my comment, you can also do the merge without a heap. See https://stackoverflow.com/a/18984961/56778 for a description.
Paul Hankin's solution is right, but it's a little bit difficult to read, especially you wanna implement in c++ or java. My solution is similar to Paul's. If you write in c++ or java, you may need an extra data structure to store the value of the element, index of the element in k-th array and index of the array in lists.
Element{
int value;
int idInArray,
int idInList
}
But in python, I just store in a tuple (value, idInArray, idInList)
def mergeKArray(*lists):
# implemented by min heap
h = []
r = []
for k, arr in enumerate(lists):
heapq.heappush(h, (arr[0], 0, k))
while h:
# min is the minimum element
# i is the index of the min in the k-th array
# k is the index of array in the list
min, i, k = heapq.heappop(h)
r.append(min)
if i < len(lists[k]) - 1:
i += 1
heapq.heappush(h, (lists[k][i], i, k))
return r
Because I just need to maintain a min heap with k elements, the time of eject or inject heap is O(log k). I also need to scan all n elements, and each element cost 2*log(k) time to inject and eject. Therefore, the big-O is O(n*log k).
Instead of simply storing the first element of each list in the priority queue, wrap it in a structure like this;
struct wrapper
{
int list_number;
int element;
}
Then, when you are pushing an element onto the priority queue, just add the list number form where it came. This way, when the minimum element gets popped, you will know from which list you should push the next element to push on the queue by examining popped_element.list_number
.
In order to find if your list is empty you should add a function empty
to it that returns true
if the list does not have any more elements and false otherwise. The function would be very easy to implement. Just check if the size is zero then the list is empty otherwise it has one or more elements.
From your question I assume that a binary heap is used to implement the priority queue. The insertion operation in a binary heap takes O(lg k)
time and the extract-min operation also takes O(lg k)
times where k
is the the size of the heap (number of lists in your case). Now if the total number of elements you have is n
, the total time to process all of them will be O(n lg k)
.
You should not here that "n" the total number of nodes over all the list not just one list. In this case the solution is O(n logk). If we mean that we have n node on an average in every list(total of k lists) then it will be O(nk logk)
Here is an in depth explanation with some code
here is my code using c++ stl
#include<iostream>
#include<vector>
#include<climits>
#include<queue>
#define ROWS 4
#define COLS 8
using namespace std;
struct node
{
int ele;
int arr_no;
int next_index;
};
void printVector(vector<int> v)
{
for(unsigned int i=0;i<v.size();i++)
cout<<v[i]<<" ";
}
// THIS IS THE BASIS ON WHICH THE ELEMENTS OF A PRIORITY QUEUE ARE SORTED AND
// KEPT IN THE QUEUE, HERE THE CRITERIA IS THAT THE NODE WITH SMALLER ELEMENT SHOULD
// COME ABOVE THE ONE WITH LARGER ELEMENT
class compare
{
public:
bool operator()(node& n1, node& n2)
{
if (n1.ele > n2.ele)
return true;
else
return false;
}
};
vector<int> mergeKArrays(vector< vector<int> > v)
{
int k = v.size(); // NUMBER OF LISTS
int n = v.at(0).size(); //SIZE OF EACH LIST
vector<int> result;
//result.resize( n*k );
priority_queue<node, vector<node>, compare> minHeap;
for (int i = 0; i < k; i++)
{
node temp;
temp.ele = v[i][0]; //STORE THE FIRST ELEMENT
temp.arr_no = i; //INDEX OF ARRAY
temp.next_index = 1; //INDEX OF NEXT ELEMENT TO BE STORED FROM ARRAY
minHeap.push(temp);
}
// NOW ONE BY ONE GET THE MINIMUM ELEMENT FROM MIN
// HEAP AND REPLACE IT WITH NEXT ELEMENT OF ITS ARRAY
for (int count = 0; count < n*k; count++)
{
// GET THE MINIMUM ELEMENT AND STORE IT IN OUTPUT
node min_ele_node = minHeap.top();
minHeap.pop();
result.push_back(min_ele_node.ele);
// FIND THE NEXT ELELEMENT THAT WILL REPLACE CURRENT
// ROOT OF HEAP. THE NEXT ELEMENT BELONGS TO SAME
// ARRAY AS THE CURRENT ROOT.
node new_node;
new_node.arr_no = min_ele_node.arr_no;
if (min_ele_node.next_index < n)
{
new_node.ele = v.at(min_ele_node.arr_no)[min_ele_node.next_index];
new_node.next_index = min_ele_node.next_index + 1;
}
// IF ROOT WAS THE LAST ELEMENT OF ITS ARRAY
else
{
new_node.ele = INT_MAX; //INT_MAX IS FOR INFINITE
}
// REPLACE ROOT WITH NEXT ELEMENT OF ARRAY
minHeap.push(new_node);
}
return result;
}
int main()
{
int arr[ROWS][COLS] =
{
{10, 20, 30, 40, 50, 60, 71, 86},
{15, 25, 35, 45, 60, 69, 77, 78},
{27, 29, 37, 48, 50, 65, 75, 78},
{32, 33, 39, 50, 80, 133, 139, 150},
};
vector< vector<int> > matrix ;
for( int i=0 ; i < ROWS; i++)
{
vector<int> vec;
for(int j=0; j < COLS; j++)
vec.push_back(arr[i][j]);
matrix.push_back(vec);
}
vector<int> result = mergeKArrays(matrix);
printVector(result);
return 0;
}
Here's some Python 2 code that does the merging.
import heapq
def addtoheap(h, i, it):
try:
heapq.heappush(h, (next(it), i))
except StopIteration:
pass
def mergek(*lists):
its = map(iter, lists)
h = []
for i, it in enumerate(its):
addtoheap(h, i, it)
while h:
v, i = heapq.heappop(h)
addtoheap(h, i, its[i])
yield v
for x in mergek([1, 3, 5], [2, 4, 6], [7, 8, 9], [10]):
print x
Why is it O(n log k)? Well for each value removed, there's a heap pop and possibly a heap push (both of which are O(log k)). Since we remove n elements, it's O(n log k).