After reading Guido\'s Sorting a million 32-bit integers in 2MB of RAM using Python, I discovered the heapq
module, but the concept is pretty abstract to me.
Comparing it to a self-balancing binary tree, a heap doesn't seem to gain you much if you just look at complexity:
But whereas a binary tree tends to need each node pointing to its children for efficiency, a heap stores its data packed tightly into an array. This allows you to store much more data in a fixed amount of memory.
So for the cases when you only need insertion and max-removal, a heap is perfect and can often use half as much memory as a self-balancing binary tree (and much easier to implement if you have to). The standard use-case is a priority queue.
This was an accidental discovery of me by trying to see how could I implement the Counter Module in Python 2.6. Just have a look into the implementation and usage of collections.Counter. This is actually implemented through heapq.
The heapq module is commonly use to implement priority queues.
You see priority queues in event schedulers that are constantly adding new events and need to use a heap to efficiently locate the next scheduled event. Some examples include:
The heapq docs include priority queue implementation notes which address the common use cases.
In addition, heaps are great for implementing partial sorts. For example, heapq.nsmallest and heapq.nlargest can be much more memory efficient and do many fewer comparisons than a full sort followed by a slice:
>>> from heapq import nlargest
>>> from random import random
>>> nlargest(5, (random() for i in xrange(1000000)))
[0.9999995650034837, 0.9999985756262746, 0.9999971934450994, 0.9999960394998497, 0.9999949126363714]