What is the difference between `bitCount()` and `bitLength()` of a `BigInteger`

前端 未结 2 900
青春惊慌失措
青春惊慌失措 2021-01-11 19:19

The descriptions of bitCount() and bitLength() are rather cryptic:

public int bitCount()

Returns the number of bits in th

相关标签:
2条回答
  • 2021-01-11 20:12

    A quick demonstration:

    public void test() {
        BigInteger b = BigInteger.valueOf(0x12345L);
        System.out.println("b = " + b.toString(2));
        System.out.println("bitCount(b) = " + b.bitCount());
        System.out.println("bitLength(b) = " + b.bitLength());
    }
    

    prints

    b = 10010001101000101

    bitCount(b) = 7

    bitLength(b) = 17

    So, for positive integers:

    bitCount() returns the number of set bits in the number.

    bitLength() returns the position of the highest set bit i.e. the length of the binary representation of the number (i.e. log2).

    0 讨论(0)
  • 2021-01-11 20:13

    Another basic function is missing:

    • bitCount() is useful to find the cardinal of a set of integers;
    • bitLength() is useful to find the largest of integers that are members in this set;
    • getLowestSetBit() is still needed to find the smallest of integers that are members in this set (this is also needed to implement fast iterators on bitsets).

    There are efficient ways to:

    • reduce a very large bitset to a bitCount() without having to shift each word stored in the bitset (e.g. 64-bit words) using a slow loop over each of the 64-bits. This does not require any loop and can be computed using a small bounded number of arithmetic operations on 64-bit numbers (with the additional benefit: no need to perform any test for loop conditions, parallelism is possible, less than 64 operations for 64-bit words, so the cost is in O(1) time.
    • compute the bitLength(): you just need the number of words used to store the bitset, or its highest used index in an array of words, and then a small arithmetic operations on the single word stored at this index: on a 64-bit word, at most 8 arithmetic operations are sufficient, so the cost is in O(1) time.
    • but for the bitSmallest(): you still need to perform a binary search to locate the highest "bit-splitting" position in a word (at unknown position in the lowest subset of words, that still need to be scanned as long as they are all zeroes, so parallelization is difficult and the cost is O(N) time where N is the bitLength() of the bitset) under which all bits are zeroes. And I wonder if we can avoid the costly tests-and-branches on the first non-all zero words, using only arithmetic, so that full parallelism can be used to give a reply in O(1) time for this last word only.

    In my opinion the 3rd problem requires a more efficient storage for bitsets than a flat array of words: we need a representation using a binary tree instead:

    Suppose you want to store 64 bits in a bitset

    • this set is equivalent to storing 2 subsets A and B, of 32 bits for each
    • but instead of naively storing {A, B} you can store {A or B, (A or B) xor A, (A or B) xor B}, where "or" and "xor" are bit-for-bit operations (this basically adding 50% of info, by not storing jsut two separate elements but their "sum" and their respective difference of this sum).
    • You can apply it recursively for 128 bits, 256 bits, but in fact you could as well avoid the 50% cost at each step by summing more elements. using the "xor" differences instead of elements themselves can be used to accelerate some operations (not shown here), like other compression schemes that are efficient on sparse sets.
    • This allows faster scanning of zeroes because you can skip very fast, in O(log2(N)) time the null bits and locate words that have some non-zero bits: they have (A or B)==0.

    Another common usage of bitsets is to allow them to represent their complement, but this is not easy when the number of integers that the set could have as members if very large (e.g. to represent a set of 64-bit integers): the bitset should then reserve at least one bit to indicate that the bitsets does NOT store directly the integers that are members of the set, but instead store only the integers that are NOT members of the set.

    And an efficient representation of the bitset using a tree-like structure should allow each node in the binary tree to choose if it should store the members or the non-members, depending on the cardinality of members in each subrange (each subrange will represent a subset of all integers between k and (k+2^n-1), where k is the node number in the binary tree, each node storing a single word of n bits; one of these bits storing if the word contains members or non-members).

    There's an efficient way to store binary trees in a flat indexed array, if the tree is dense enough to have few words set with bits all set to 0 or all set to 1. If this is not the case (for very "sparse" sets), you need something else using pointers like a B-tree, where each page of the B-tree can be either a flat "dense" range, or an ordered index of subtrees: you'll store flat dense ranges in leaf nodes which can be allocated in a flat array, and you'll sore other nodes separately in another store that can also be an array: instead of a pointer from one node to the other for a subbranch of the btree, you use an index in that array; the index itself can have one bit indicating if you are pointing to other pages of branches, or to a leaf node.

    But the current default implementation of bitsets in Java collections does not use these technics, so BitSets are still not efficient enough to store very sparse sets of large integers. You need your own library to reduce the storage requirement and still allow fast lookup in the bitset, in O(log2(N)) time, to determine if an integer is a member or not of the set of integers represented by this optimized bitset.

    But anyway the default Java implementation is sufficient if you just need bitCount() and bitLength() and your bitsets are used for dense sets, for sets of small integers (for a set of 16-bit integers, a naive approach storing 64K bit, i.e. using 8KB of memory at most, is generally enough).

    For very sparse sets of large integers, it will always be more efficient to just store a sorted array of integer values (e.g. not more than one bit every 128 bits), or a hashed table if the bit set would not set more than 1 bit for every range of 32 bits: you can still add an extra bit in these structures to store the "complement" bit.

    But I've not found that getLowestSetBit() was efficient enough: the BigInteger package still cannot support very sparse bitsets without huge memory costs, even if BigInteger can be used easility to represent the "complement" bit just as a "sign bit" with its signum() and substract methods, which are efficient.

    Very large and very sparse bitsets are needed for example for somme wellknown operations, like searches in large very databases of RDF tuples in a knowledge database, each tuple being indexed by a very large GUID (represented by 128-bit integers): you need to be able to perform binary operations like unions, differences, and complements.

    0 讨论(0)
提交回复
热议问题