What are some alternatives to a bit array?

前端 未结 7 1693
夕颜
夕颜 2021-02-06 07:05

I have an information retrieval application that creates bit arrays on the order of 10s of million bits. The number of \"set\" bits in the array varies widely, from all clear to

相关标签:
7条回答
  • 2021-02-06 07:13

    Quick combinatoric proof that you can't really save much space:

    Suppose you have an arbitrary subset of n/2 bits set to 1 out of n total bits. You have (n choose n/2) possibilities. Using Stirling's formula, this is roughly 2^n / sqrt(n) * sqrt(2/pi). If every possibility is equally likely, then there's no way to give more likely choices shorter representations. So we need log_2 (n choose n/2) bits, which is about n - (1/2)log(n) bits.

    That's not a very good savings of memory. For example, if you're working with n=2^20 (1 meg), then you can only save about 10 bits. It's just not worth it.

    Having said all that, it also seems very unlikely that any really useful data is truly random. In case there's any more structure to your data, there's probably a more optimistic answer.

    0 讨论(0)
  • 2021-02-06 07:19

    Straight forward lossless compression is the way to go. To make it searchable you will have to compress relatively small blocks and create an index into an array of the blocks. This index can contain the bit offset of the starting bit in each block.

    0 讨论(0)
  • 2021-02-06 07:21

    I would strongly consider using range encoding in place of Huffman coding. In general, range encoding can exploit asymmetry more effectively than Huffman coding, but this is especially so when the alphabet size is so small. In fact, when the "native alphabet" is simply 0s and 1s, the only way Huffman can get any compression at all is by combining those symbols -- which is exactly what range encoding will do, more effectively.

    0 讨论(0)
  • 2021-02-06 07:34

    Maybe too late for you, but there is a very fast and memory efficient library for sparse bit arrays (lossless) and other data types based on tries. Look at Judy arrays

    0 讨论(0)
  • 2021-02-06 07:34

    One more compression thought:

    If the bit array is not crazy long, you could try applying the Burrows-Wheeler transform before using any repetition encoding, such as Huffman. A naive implementation would take O(n^2) memory during (de)compression and O(n^2 log n) time to decompress - there are almost certainly shortcuts to be had, as well. But if there's any sequential structure to your data at all, this should really help the Huffman encoding out.

    You could also apply that idea to one block at a time to keep the time/memory usage more practical. Using one block at time could allow you to always keep most of the data structure compressed if you're reading/writing sequentially.

    0 讨论(0)
  • 2021-02-06 07:37

    Unless the data is truly random and has a symmetric 1/0 distribution, then this simply becomes a lossless data compression problem and is very analogous to CCITT Group 3 compression used for black and white (i.e.: Binary) FAX images. CCITT Group 3 uses a Huffman Coding scheme. In the case of FAX they are using a fixed set of Huffman codes, but for a given data set, you can generate a specific set of codes for each data set to improve the compression ratio achieved. As long as you only need to access the bits sequentially, as you implied, this will be a pretty efficient approach. Random access would create some additional challenges, but you could probably generate a binary search tree index to various offset points in the array that would allow you to get close to the desired location and then walk in from there.

    Note: The Huffman scheme still works well even if the data is random, as long as the 1/0 distribution is not perfectly even. That is, the less even the distribution, the better the compression ratio.

    Finally, if the bits are truly random with an even distribution, then, well, according to Mr. Claude Shannon, you are not going to be able to compress it any significant amount using any scheme.

    0 讨论(0)
提交回复
热议问题