I\'m trying to obtain the top say, 100 scores from a list of scores being generated by my program. Unfortuatly the list is huge (on the order of millions to billions) so sorting
You can do this in O(n) time, without any sorting, using a heap:
#!/usr/bin/python
import heapq
def top_n(l, n):
top_n = []
smallest = None
for elem in l:
if len(top_n) < n:
top_n.append(elem)
if len(top_n) == n:
heapq.heapify(top_n)
smallest = heapq.nsmallest(1, top_n)[0]
else:
if elem > smallest:
heapq.heapreplace(top_n, elem)
smallest = heapq.nsmallest(1, top_n)[0]
return sorted(top_n)
def random_ints(n):
import random
for i in range(0, n):
yield random.randint(0, 10000)
print top_n(random_ints(1000000), 100)
Times on my machine (Core2 Q6600, Linux, Python 2.6, measured with bash time
builtin):
Edit/addition: In C++, you can use std::priority_queue
in much the same way as Python's heapq
module is used here. You'll want to use the std::greater
ordering instead of the default std::less
, so that the top()
member function returns the smallest element instead of the largest one. C++'s priority queue doesn't have the equivalent of heapreplace
, which replaces the top element with a new one, so instead you'll want to pop
the top (smallest) element and then push
the newly seen value. Other than that the algorithm translates quite cleanly from Python to C++.