Fastest way to obtain the largest X numbers from a very large unsorted list?

后端 未结 3 787
佛祖请我去吃肉
佛祖请我去吃肉 2021-02-15 04:51

I\'m trying to obtain the top say, 100 scores from a list of scores being generated by my program. Unfortuatly the list is huge (on the order of millions to billions) so sorting

相关标签:
3条回答
  • 2021-02-15 05:31

    Here's the 'natural' C++ way to do this:

    std::vector<Score> v;
    // fill in v
    std::partial_sort(v.begin(), v.begin() + 100, v.end(), std::greater<Score>());
    std::sort(v.begin(), v.begin() + 100);
    

    This is linear in the number of scores.

    The algorithm used by std::sort isn't specified by the standard, but libstdc++ (used by g++) uses an "adaptive introsort", which is essentially a median-of-3 quicksort down to a certain level, followed by an insertion sort.

    0 讨论(0)
  • 2021-02-15 05:43

    You can do this in O(n) time, without any sorting, using a heap:

    #!/usr/bin/python
    
    import heapq
    
    def top_n(l, n):
        top_n = []
    
        smallest = None
    
        for elem in l:
            if len(top_n) < n:
                top_n.append(elem)
                if len(top_n) == n:
                    heapq.heapify(top_n)
                    smallest = heapq.nsmallest(1, top_n)[0]
            else:
                if elem > smallest:
                    heapq.heapreplace(top_n, elem)
                    smallest = heapq.nsmallest(1, top_n)[0]
    
        return sorted(top_n)
    
    
    def random_ints(n):
        import random
        for i in range(0, n):
            yield random.randint(0, 10000)
    
    print top_n(random_ints(1000000), 100)
    

    Times on my machine (Core2 Q6600, Linux, Python 2.6, measured with bash time builtin):

    • 100000 elements: .29 seconds
    • 1000000 elements: 2.8 seconds
    • 10000000 elements: 25.2 seconds

    Edit/addition: In C++, you can use std::priority_queue in much the same way as Python's heapq module is used here. You'll want to use the std::greater ordering instead of the default std::less, so that the top() member function returns the smallest element instead of the largest one. C++'s priority queue doesn't have the equivalent of heapreplace, which replaces the top element with a new one, so instead you'll want to pop the top (smallest) element and then push the newly seen value. Other than that the algorithm translates quite cleanly from Python to C++.

    0 讨论(0)
  • 2021-02-15 05:51
    1. take the first 100 scores, and sort them in an array.
    2. take the next score, and insertion-sort it into the array (starting at the "small" end)
    3. drop the 101st value
    4. continue with the next value, at 2, until done

    Over time, the list will resemble the 100 largest value more and more, so more often, you find that the insertion sort immediately aborts, finding that the new value is smaller than the smallest value of the candidates for the top 100.

    0 讨论(0)
提交回复
热议问题