Top n items in a List ( including duplicates )

前端未结

关注

 5  2029

谎友^

Trying to find an efficient way to obtain the top N items in a very large list, possibly containing duplicates.

I first tried sorting & slicing, which works. But

相关标签:

5条回答

余生分开走

2021-01-20 22:24
Don't overestimate how big log(M) is, for a large list of length M. For a list containing a billion items, log(M) is only 30. So sorting and taking is not such an unreasonable method after all. In fact, sorting an array of integers is far faster thank sorting a list (and the array takes less memory also), so I would say that your best (brief) bet (which is safe for short or empty lists thanks to takeRight)
```
val arr = s.toArray
java.util.Arrays.sort(arr)
arr.takeRight(N).toList
```
There are various other approaches one could take, but the implementations are less straightforward. You could use a partial quicksort, but you have the same problems with worst-case scenarios that quicksort does (e.g. if your list is already sorted, a naive algorithm might be O(n^2)!). You could save the top N in a ring buffer (array), but that would require O(log N) binary search every step as well as O(N/4) sliding of elements--only good if N is quite small. More complex methods (like something based upon dual pivot quicksort) are, well, more complex.

So I recommend that you try array sorting and see if that's fast enough.

(Answers differ if you're sorting objects instead of numbers, of course, but if your comparison can always be reduced to a number, you can s.map(x => /* convert element to corresponding number*/).toArray and then take the winning scores and run through the list again, counting off the number that you need to take of each score as you find them; it's a bit of bookkeeping, but doesn't slow things down much except for the map.)
0 讨论(0)
发布评论:

提交评论
- 加载中...
难免孤独

2021-01-20 22:26

The classic algorithm is called QuickSelect. It is like QuickSort, except you only descend into half of the tree, so it ends up being O(n) on average.

0 讨论(0)
发布评论:

提交评论
- 加载中...
时光取名叫无心

2021-01-20 22:28
Unless I'm missing something, why not just traverse the list and pick the top 20 as you go? So long as you keep track of the smallest element of the top 20 there should be no overhead except when adding to the top 20, which should be relatively rare for a long list. Here's an implementation:
```
  def topNs(xs: TraversableOnce[Int], n: Int) = {
    var ss = List[Int]()
    var min = Int.MaxValue
    var len = 0
    xs foreach { e =>
      if (len < n || e > min) {
        ss = (e :: ss).sorted
        min = ss.head
        len += 1
      }
      if (len > n) {
        ss = ss.tail
        min = ss.head
        len -= 1
      }                    
    }
    ss
  }  
```
(edited because I originally used a SortedSet not realising you wanted to keep duplicates.)

I benchmarked this for a list of 100k random Ints, and it took on average 40 ms. Your elite method takes about 850 ms and and your elite2 method takes about 4100 ms. So this is over 20 x quicker than your fastest.
0 讨论(0)
发布评论:

提交评论
- 加载中...

暖寄归人

2021-01-20 22:36

I wanted a version that was polymorphic, and also allowed to compose using a single iterator. For instance, what if you wanted the top largest and smallest elements when reading from a file? Here is what I came up with:

    import util.Sorting.quickSort

    class TopNSet[T](n:Int) (implicit ev: Ordering[T], ev2: ClassManifest[T]){
      val ss = new Array[T](n)
      var len = 0

      def tryElement(el:T) = {
        if(len < n-1){
          ss(len) = el
          len += 1
        }
         else if(len == n-1){
          ss(len) = el
          len = n
          quickSort(ss)
        }
        else if(ev.gt(el, ss(0))){
          ss(0) = el
          quickSort(ss)
        }
      }
      def getTop() = {
        ss.slice(0,len)
      }
    }

Evaluating compared to the accepted answer:

val myInts = Array.fill(100000000)(util.Random.nextInt)
time(topNs(myInts,100)
//Elapsed time 3006.05485 msecs
val myTopSet = new TopNSet[In](100)
time(myInts.foreach(myTopSet.tryElement(_)))
//Elapsed time 4334.888546 msecs

So, not much slower, and certainly a lot more flexible

0 讨论(0)

一生所求

2021-01-20 22:44
Here's pseudocode for the algorithm I'd use:
```
selectLargest(n: Int, xs: List): List
  if size(xs) <= n
     return xs
  pivot <- selectPivot(xs)
  (lt, gt) <- partition(xs, pivot)
  if size(gt) == n
     return gt
  if size(gt) < n
     return append(gt, selectLargest(n - size(gt), lt))
  if size(gt) > n
     return selectLargest(n, gt)
```
selectPivot would use some technique to select a "pivot" value for partitioning the list. partition would split the list into two: lt (elements smaller than the pivot) and gt (elements greater than the pivot). Of course, you'd need to throw elements equal to the pivot in one of those groups, or else handle that group separately. It doesn't make a big difference, as long as you remember to handle that case somehow.

Feel free to edit this answer, or post your own answer, with a Scala implementation of this algorithm.
0 讨论(0)
发布评论:

提交评论
- 加载中...