Trying to find an efficient way to obtain the top N items in a very large list, possibly containing duplicates.
I first tried sorting & slicing, which works. But
Unless I'm missing something, why not just traverse the list and pick the top 20 as you go? So long as you keep track of the smallest element of the top 20 there should be no overhead except when adding to the top 20, which should be relatively rare for a long list. Here's an implementation:
def topNs(xs: TraversableOnce[Int], n: Int) = {
var ss = List[Int]()
var min = Int.MaxValue
var len = 0
xs foreach { e =>
if (len < n || e > min) {
ss = (e :: ss).sorted
min = ss.head
len += 1
}
if (len > n) {
ss = ss.tail
min = ss.head
len -= 1
}
}
ss
}
(edited because I originally used a SortedSet
not realising you wanted to keep duplicates.)
I benchmarked this for a list of 100k random Ints, and it took on average 40 ms. Your elite
method takes about 850 ms and and your elite2
method takes about 4100 ms. So this is over 20 x quicker than your fastest.