Trying to find an efficient way to obtain the top N items in a very large list, possibly containing duplicates.
I first tried sorting & slicing, which works. But
Don't overestimate how big log(M)
is, for a large list of length M
. For a list containing a billion items, log(M)
is only 30. So sorting and taking is not such an unreasonable method after all. In fact, sorting an array of integers is far faster thank sorting a list (and the array takes less memory also), so I would say that your best (brief) bet (which is safe for short or empty lists thanks to takeRight
)
val arr = s.toArray
java.util.Arrays.sort(arr)
arr.takeRight(N).toList
There are various other approaches one could take, but the implementations are less straightforward. You could use a partial quicksort, but you have the same problems with worst-case scenarios that quicksort does (e.g. if your list is already sorted, a naive algorithm might be O(n^2)
!). You could save the top N
in a ring buffer (array), but that would require O(log N)
binary search every step as well as O(N/4)
sliding of elements--only good if N
is quite small. More complex methods (like something based upon dual pivot quicksort) are, well, more complex.
So I recommend that you try array sorting and see if that's fast enough.
(Answers differ if you're sorting objects instead of numbers, of course, but if your comparison can always be reduced to a number, you can s.map(x => /* convert element to corresponding number*/).toArray
and then take the winning scores and run through the list again, counting off the number that you need to take of each score as you find them; it's a bit of bookkeeping, but doesn't slow things down much except for the map.)
The classic algorithm is called QuickSelect. It is like QuickSort, except you only descend into half of the tree, so it ends up being O(n) on average.
Unless I'm missing something, why not just traverse the list and pick the top 20 as you go? So long as you keep track of the smallest element of the top 20 there should be no overhead except when adding to the top 20, which should be relatively rare for a long list. Here's an implementation:
def topNs(xs: TraversableOnce[Int], n: Int) = {
var ss = List[Int]()
var min = Int.MaxValue
var len = 0
xs foreach { e =>
if (len < n || e > min) {
ss = (e :: ss).sorted
min = ss.head
len += 1
}
if (len > n) {
ss = ss.tail
min = ss.head
len -= 1
}
}
ss
}
(edited because I originally used a SortedSet
not realising you wanted to keep duplicates.)
I benchmarked this for a list of 100k random Ints, and it took on average 40 ms. Your elite
method takes about 850 ms and and your elite2
method takes about 4100 ms. So this is over 20 x quicker than your fastest.
I wanted a version that was polymorphic, and also allowed to compose using a single iterator. For instance, what if you wanted the top largest and smallest elements when reading from a file? Here is what I came up with:
import util.Sorting.quickSort
class TopNSet[T](n:Int) (implicit ev: Ordering[T], ev2: ClassManifest[T]){
val ss = new Array[T](n)
var len = 0
def tryElement(el:T) = {
if(len < n-1){
ss(len) = el
len += 1
}
else if(len == n-1){
ss(len) = el
len = n
quickSort(ss)
}
else if(ev.gt(el, ss(0))){
ss(0) = el
quickSort(ss)
}
}
def getTop() = {
ss.slice(0,len)
}
}
Evaluating compared to the accepted answer:
val myInts = Array.fill(100000000)(util.Random.nextInt)
time(topNs(myInts,100)
//Elapsed time 3006.05485 msecs
val myTopSet = new TopNSet[In](100)
time(myInts.foreach(myTopSet.tryElement(_)))
//Elapsed time 4334.888546 msecs
So, not much slower, and certainly a lot more flexible
Here's pseudocode for the algorithm I'd use:
selectLargest(n: Int, xs: List): List
if size(xs) <= n
return xs
pivot <- selectPivot(xs)
(lt, gt) <- partition(xs, pivot)
if size(gt) == n
return gt
if size(gt) < n
return append(gt, selectLargest(n - size(gt), lt))
if size(gt) > n
return selectLargest(n, gt)
selectPivot
would use some technique to select a "pivot" value for partitioning the list. partition
would split the list into two: lt
(elements smaller than the pivot) and gt
(elements greater than the pivot). Of course, you'd need to throw elements equal to the pivot in one of those groups, or else handle that group separately. It doesn't make a big difference, as long as you remember to handle that case somehow.
Feel free to edit this answer, or post your own answer, with a Scala implementation of this algorithm.