I have an information retrieval application that creates bit arrays on the order of 10s of million bits. The number of \"set\" bits in the array varies widely, from all clear to
Thanks for the answers. This is what I'm going to try for dynamically choosing the right method:
I'll collect all of the first N hits in a conventional bit array, and choose one of three methods, based on the symmetry of this sample.
The boundaries between the asymmetric, moderate, and symmetric regions will depend on the time required by the various algorithms balanced against the space they need, where the relative value of time versus space would be an adjustable parameter. The space needed for Huffman coding is a function of the symmetry, and I'll profile that with testing. Also, I'll test all three methods to determine the time requirements of my implementation.
It's possible (and actually I'm hoping) that the middle compression method will always be better than the list or the bit array or both. Maybe I can encourage this by choosing a set of Huffman codes adapted for higher or lower symmetry. Then I can simplify the system and just use two methods.