Fastest way to obtain the largest X numbers from a very large unsorted list?

后端 未结 3 811
佛祖请我去吃肉
佛祖请我去吃肉 2021-02-15 04:51

I\'m trying to obtain the top say, 100 scores from a list of scores being generated by my program. Unfortuatly the list is huge (on the order of millions to billions) so sorting

3条回答
  •  粉色の甜心
    2021-02-15 05:43

    You can do this in O(n) time, without any sorting, using a heap:

    #!/usr/bin/python
    
    import heapq
    
    def top_n(l, n):
        top_n = []
    
        smallest = None
    
        for elem in l:
            if len(top_n) < n:
                top_n.append(elem)
                if len(top_n) == n:
                    heapq.heapify(top_n)
                    smallest = heapq.nsmallest(1, top_n)[0]
            else:
                if elem > smallest:
                    heapq.heapreplace(top_n, elem)
                    smallest = heapq.nsmallest(1, top_n)[0]
    
        return sorted(top_n)
    
    
    def random_ints(n):
        import random
        for i in range(0, n):
            yield random.randint(0, 10000)
    
    print top_n(random_ints(1000000), 100)
    

    Times on my machine (Core2 Q6600, Linux, Python 2.6, measured with bash time builtin):

    • 100000 elements: .29 seconds
    • 1000000 elements: 2.8 seconds
    • 10000000 elements: 25.2 seconds

    Edit/addition: In C++, you can use std::priority_queue in much the same way as Python's heapq module is used here. You'll want to use the std::greater ordering instead of the default std::less, so that the top() member function returns the smallest element instead of the largest one. C++'s priority queue doesn't have the equivalent of heapreplace, which replaces the top element with a new one, so instead you'll want to pop the top (smallest) element and then push the newly seen value. Other than that the algorithm translates quite cleanly from Python to C++.

提交回复
热议问题