I have an information retrieval application that creates bit arrays on the order of 10s of million bits. The number of \"set\" bits in the array varies widely, from all clear to
I would strongly consider using range encoding in place of Huffman coding. In general, range encoding can exploit asymmetry more effectively than Huffman coding, but this is especially so when the alphabet size is so small. In fact, when the "native alphabet" is simply 0s and 1s, the only way Huffman can get any compression at all is by combining those symbols -- which is exactly what range encoding will do, more effectively.