How do I represent binary numbers in C++ (used for Huffman encoder)?

问题

I am writing my own Huffman encoder, and so far I have created the Huffman tree by using a minHeap to pop off the two lowest frequency nodes and make a node that links to them and then pushing the new node back one (lather, rinse, repeat until only one node).

So now I have created the tree, but I need to use this tree to assign codes to each character. My problem is I don't know how to store the binary representation of a number in C++. I remember reading that unsigned char is the standard for a byte, but I am unsure.

I know I have to recusively traverse the tree and whenever I hit a leaf node I must assign the corresponding character whatever code is current representing the path.

Here is what I have so far:

void traverseFullTree(huffmanNode* root, unsigned char curCode, unsigned char &codeBook){

    if(root->leftChild == 0 && root->rightChild == 0){ //you are at a leaf node, assign curCode to root's character
        codeBook[(int)root->character] = curCode;
    }else{ //root has children, recurse into them with the currentCodes updated for right and left branch
        traverseFullTree(root->leftChild, **CURRENT CODE SHIFTED WITH A 0**, codeBook );
        traverseFullTree(root->rightChild, **CURRENT CODE SHIFTED WITH A 1**, codeBook);
    }

    return 0;
}

CodeBook is my array that has a place for the codes of up to 256 characters (for each possible character in ASCII), but I am only going to actually assign codes to values that appear in the tree.

I am not sure if this is the corrent way to traverse my Huffman tree, but this is what immediately seems to work (though I haven't tested it). Also how do I call the traverse function of the root of the whole tree with no zeros OR ones (the very top of the tree)?

Should I be using a string instead and appending to the string either a zero or a 1?

回答1:

Since computers are binary ... ALL numbers in C/C++ are already in binary format.

int a = 10;

The variable a is binary number.

What you want to look at is bit manipulation, operators such as & | << >>.

With the Huffman encoding, you would pack the data down into an array of bytes.

It's been a long time since I've written C, so this is an "off-the-cuff" pseudo-code...

Totally untested -- but should give you the right idea.

char buffer[1000]; // This is the buffer we are writing to -- calc the size out ahead of time or build it dynamically as go with malloc/ remalloc.

void set_bit(bit_position) {
  int byte = bit_position / 8;
  int bit = bit_position % 8;

  // From http://stackoverflow.com/questions/47981/how-do-you-set-clear-and-toggle-a-single-bit-in-c
  byte |= 1 << bit;
}

void clear_bit(bit_position) {
  int byte = bit_position / 8;
  int bit = bit_position % 8;

  // From http://stackoverflow.com/questions/47981/how-do-you-set-clear-and-toggle-a-single-bit-in-c
 bite &= ~(1 << bit);
}


// and in your loop, you'd just call these functions to set the bit number.
set_bit(0);
clear_bit(1);

回答2:

Since the curCode has only zero and one as its value, BitSet might suit your need. It is convenient and memory-saving. Reference this: http://www.sgi.com/tech/stl/bitset.html

Only a little change to your code:

void traverseFullTree(huffmanNode* root, unsigned char curCode, BitSet<N> &codeBook){

    if(root->leftChild == 0 && root->rightChild == 0){ //you are at a leaf node, assign curCode to root's character
        codeBook[(int)root->character] = curCode;
    }else{ //root has children, recurse into them with the currentCodes updated for right and left branch
        traverseFullTree(root->leftChild, **CURRENT CODE SHIFTED WITH A 0**, codeBook );
        traverseFullTree(root->rightChild, **CURRENT CODE SHIFTED WITH A 1**, codeBook);
    }

    return 0;
}

回答3:

how to store the binary representation of a number in C++

You can simply use bitsets

#include <iostream>
#include <bitset>

int main() {
  int a = 42;
  std::bitset<(sizeof(int) * 8)> bs(a);

  std::cout << bs.to_string() << "\n";
  std::cout << bs.to_ulong() << "\n";
  return (0);
}

as you can see they also provide methods for conversions to other types, and the handy [] operator.

回答4:

Please don't use a string.

You can represent the codebook as two arrays of integers, one with the bit-lengths of the codes, one with the codes themselves. There is one issue with that: what if a code is longer than an integer? The solution is to just not make that happen. Having a short-ish maximum codelength (say 15) is a trick used in most practical uses of Huffman coding, for various reasons.

I recommend using canonical Huffman codes, and that slightly simplifies your tree traversal: you'd only need the lengths, so you don't have to keep track of the current code. With canonical Huffman codes, you can generate the codes easily from the lengths.

If you are using canonical codes, you can let the codes be wider than integers, because the high bits would be zero anyway. However, it is still a good idea to limit the lengths. Having a short maximum length (well not too short, that would limit compression, but say around 16) enables you to use the simplest table-based decoding method, a simple single-level table.

Limiting code lengths to 25 or less also slightly simplifies encoding, it lets you use a 32bit integer as a "buffer" and empty it byte by byte, without any special handling of the case where the buffer holds fewer than 8 bits but encoding the current symbol would overflow it (because that case is entirely avoided - in the worst case there would be 7 bits in the buffer and you try to encode a 25-bit symbol, which works just fine).

Something like this (not tested in any way)

uint32_t buffer = 0;
int bufbits = 0;
for (int i = 0; i < symbolCount; i++)
{
    int s = symbols[i];
    buffer <<= lengths[s];  // make room for the bits
    bufbits += lengths[s];  // buffer got longer
    buffer |= values[s];    // put in the bits corresponding to the symbol

    while (bufbits >= 8)    // as long as there is at least a byte in the buffer
    {
        bufbits -= 8;       // forget it's there
        writeByte((buffer >> bufbits) & 0xFF); // and save it
    }
}

来源：https://stackoverflow.com/questions/17904652/how-do-i-represent-binary-numbers-in-c-used-for-huffman-encoder

标签

c++

binary-tree

bit-manipulation

huffman-code

tree-traversal