Trying to find an efficient way to obtain the top N items in a very large list, possibly containing duplicates.
I first tried sorting & slicing, which works. But
Here's pseudocode for the algorithm I'd use:
selectLargest(n: Int, xs: List): List
if size(xs) <= n
return xs
pivot <- selectPivot(xs)
(lt, gt) <- partition(xs, pivot)
if size(gt) == n
return gt
if size(gt) < n
return append(gt, selectLargest(n - size(gt), lt))
if size(gt) > n
return selectLargest(n, gt)
selectPivot
would use some technique to select a "pivot" value for partitioning the list. partition
would split the list into two: lt
(elements smaller than the pivot) and gt
(elements greater than the pivot). Of course, you'd need to throw elements equal to the pivot in one of those groups, or else handle that group separately. It doesn't make a big difference, as long as you remember to handle that case somehow.
Feel free to edit this answer, or post your own answer, with a Scala implementation of this algorithm.