问题
I am implementing a LZW algorithm in C++.
The size of the dictionary is a user input, but the minimum is 256, so it should work with binary files. If it reaches the end of the dictionary it goes around to the index 0 and works up overwriting it from there.
For example, if i put in a alice in wonderland script and compress it with a dictionary size 512 i get this dictionary.
But i have a problem with decompression and the output dictionary from decompressing the compressed file looks like this.
And my code for decompressing looks like this
struct dictionary
{
vector<unsigned char> entry;
vector<bool> bits;
};
void decompress(dictionary dict[], vector<bool> file, int dictionarySize, int numberOfBits)
{
//in this example
//dictionarySize = 512, tells the max size of the dictionary, and goes back to 0 if it reaches 513
//numberOfBits = log2(512) = 9
//dictionary dict[] contains bits and strings (strings can be empty)
// dict[0] =
// entry = (unsigned char)0
// bits = (if numberOfBits = 9) 000000001
// dict[255] =
// entry = (unsigned char)255
// bits = (if numberOfBits = 9) 011111111
// so the next entry will be dict[next] (next is currently 256)
// dict[256] =
// entry = what gets added in the code below
// bits = 100000000
// all the bits are already set previously (dictionary size is int dictionarySize) so in this case all the bits from 0 to 511 are already set, entries are set from 0 to 255, so extended ASCII
vector<bool> currentCode;
vector<unsigned char> currentString;
vector<unsigned char> temp;
int next=256;
bool found=false;
for(int i=0;i<file.size();i+=numberOfBits)
{
for(int j=0;j<numberOfBits;j++)
{
currentCode.push_back(file[i+j]);
}
for(int j=0;j<dictionarySize;j++)
{
// when the currentCode (size numberOfBits) gets found in the dictionary
if(currentCode==dict[j].bits)
{
currentString = dict[j].entry;
// if the current string isnt empty, then it means it found the characted in the dictionary
if(!currentString.empty())
{
found = true;
}
}
}
//if the currentCode in the dictionary has a string value attached to it
if(found)
{
for(int j=0;j<currentString.size();j++)
{
cout<<currentString[j];
}
temp.push_back(currentString[0]);
// so it doesnt just push 1 character into the dictionary
// example, if first read character is 'r', it is already in the dictionary so it doesnt get added
if(temp.size()>1)
{
// if next is more than 511, writing to that index would cause an error, so it resets back to 0 and goes back up
if(next>dictionarySize-1) //next > 512-1
{
next = 0;
}
dict[next].entry.clear();
dict[next].entry = temp;
next++;
}
//temp = currentString;
}
else
{
currentString = temp;
currentString.push_back(temp[0]);
for(int j=0;j<currentString.size();j++)
{
cout<<currentString[j];
}
// if next is more than 511, writing to that index would cause an error, so it resets back to 0 and goes back up
if(next>dictionarySize-1)
{
next = 0;
}
dict[next].entry.clear();
dict[next].entry = currentString;
next++;
//break;
}
temp = currentString;
// currentCode gets cleared, and written into in the next iteration
currentCode.clear();
//cout<<endl;
found = false;
}
}
Im am currently stuck and dont know what to fix here to fix the output. I have also noticed, that if i put a dictionary big enough, so it doesnt go around the dictionary (it doesnt reach the end and begin again at 0) it works.
回答1:
start small
you are using files that is too much data to debug. Start small with strings. I took this nice example from Wikli:
Input: "abacdacacadaad" step input match output new_entry new_index a 0 b 1 c 2 d 3 1 abacdacacadaad a 0 ab 4 2 bacdacacadaad b 1 ba 5 3 acdacacadaad a 0 ac 6 4 cdacacadaad c 2 cd 7 5 dacacadaad d 3 da 8 6 acacadaad ac 6 aca 9 7 acadaad aca 9 acad 10 8 daad da 8 daa 11 9 ad a 0 ad 12 10 d d 3 Output: "0102369803"
So you can debug your code step by step with cross matching both input/output and dictionary contents. Once that is done correctly then you can do the same for decoding:
Input: "0102369803" step input output new_entry new_index a 0 b 1 c 2 d 3 1 0 a 2 1 b ab 4 3 0 a ba 5 4 2 c ac 6 5 3 d cd 7 6 6 ac da 8 7 9 aca aca 9 8 8 da acad 10 9 0 a daa 11 10 3 d ad 12 Output: "abacdacacadaad"
Only then move to files and clear dictionary handling.
bitstream
once you succesfully done the LZW on small alphabet you can try to use the full alphabet and bit encoding. You know the LZW stream can be encoded at any bitlength (not just 8/16/32/64 bits) which can greatly affect compression ratios (in respect to used data properties). So I would try to do univeral access to data at variable (or predefined bitlength).
Was a bit curious so I encoded a simple C++/VCL example for the compression:
//---------------------------------------------------------------------------
// LZW
const int LZW_bits=12; // encoded bitstream size
const int LZW_size=1<<LZW_bits; // dictinary size
// bitstream R/W
DWORD bitstream_tmp=0;
//---------------------------------------------------------------------------
// return LZW_bits from dat[adr,bit] and increment position (adr,bit)
DWORD bitstream_read(BYTE *dat,int siz,int &adr,int &bit,int bits)
{
DWORD a=0,m=(1<<bits)-1;
// save tmp if enough bits
if (bit>=bits){ a=(bitstream_tmp>>(bit-bits))&m; bit-=bits; return a; }
for (;;)
{
// insert byte
bitstream_tmp<<=8;
bitstream_tmp&=0xFFFFFF00;
bitstream_tmp|=dat[adr]&255;
adr++; bit+=8;
// save tmp if enough bits
if (bit>=bits){ a=(bitstream_tmp>>(bit-bits))&m; bit-=bits; return a; }
// end of data
if (adr>=siz) return 0;
}
}
//---------------------------------------------------------------------------
// write LZW_bits from a to dat[adr,bit] and increment position (adr,bit)
// return true if buffer is full
bool bitstream_write(BYTE *dat,int siz,int &adr,int &bit,int bits,DWORD a)
{
a<<=32-bits; // align to MSB
// save tmp if aligned
if ((adr<siz)&&(bit==32)){ dat[adr]=(bitstream_tmp>>24)&255; adr++; bit-=8; }
if ((adr<siz)&&(bit==24)){ dat[adr]=(bitstream_tmp>>16)&255; adr++; bit-=8; }
if ((adr<siz)&&(bit==16)){ dat[adr]=(bitstream_tmp>> 8)&255; adr++; bit-=8; }
if ((adr<siz)&&(bit== 8)){ dat[adr]=(bitstream_tmp )&255; adr++; bit-=8; }
// process all bits of a
for (;bits;bits--)
{
// insert bit
bitstream_tmp<<=1;
bitstream_tmp&=0xFFFFFFFE;
bitstream_tmp|=(a>>31)&1;
a<<=1; bit++;
// save tmp if aligned
if ((adr<siz)&&(bit==32)){ dat[adr]=(bitstream_tmp>>24)&255; adr++; bit-=8; }
if ((adr<siz)&&(bit==24)){ dat[adr]=(bitstream_tmp>>16)&255; adr++; bit-=8; }
if ((adr<siz)&&(bit==16)){ dat[adr]=(bitstream_tmp>> 8)&255; adr++; bit-=8; }
if ((adr<siz)&&(bit== 8)){ dat[adr]=(bitstream_tmp )&255; adr++; bit-=8; }
}
return (adr>=siz);
}
//---------------------------------------------------------------------------
bool str_compare(char *s0,int l0,char *s1,int l1)
{
if (l1<l0) return false;
for (;l0;l0--,s0++,s1++)
if (*s0!=*s1) return false;
return true;
}
//---------------------------------------------------------------------------
AnsiString LZW_encode(AnsiString raw)
{
AnsiString lzw="";
int i,j,k,l;
int adr,bit;
DWORD a;
const int siz=32; // bitstream buffer
BYTE buf[siz];
AnsiString dict[LZW_size]; // dictionary
int dicts=0; // actual size of dictionary
// init dictionary
for (dicts=0;dicts<256;dicts++) dict[dicts]=char(dicts); // full 8bit binary alphabet
// for (dicts=0;dicts<4;dicts++) dict[dicts]=char('a'+dicts); // test alphabet "a,b,c,d"
l=raw.Length();
adr=0; bit=0;
for (i=0;i<l;)
{
i&=i;
// find match in dictionary
for (j=dicts-1;j>=0;j--)
if (str_compare(dict[j].c_str(),dict[j].Length(),raw.c_str()+i,l-i))
{
i+=dict[j].Length();
if (i<l) // add new entry in dictionary (if not end of input)
{
// clear dictionary if full
if (dicts>=LZW_size) dicts=256; // full 8bit binary alphabet
// if (dicts>=LZW_size) dicts=4; // test alphabet "a,b,c,d"
else{
dict[dicts]=dict[j]+AnsiString(raw[i+1]); // AnsiString index starts from 1 hence the +1
dicts++;
}
}
a=j; j=-1; break; // full binary output
// a='0'+j; j=-1; break; // test ASCII output
}
// store result to bitstream
if (bitstream_write(buf,siz,adr,bit,LZW_bits,a))
{
// append buf to lzw
k=lzw.Length();
lzw.SetLength(k+adr);
for (j=0;j<adr;j++) lzw[j+k+1]=buf[j];
// reset buf
adr=0;
}
}
if (bit)
{
// store the remainding bits with zeropad
bitstream_write(buf,siz,adr,bit,LZW_bits-bit,0);
}
if (adr)
{
// append buf to lzw
k=lzw.Length();
lzw.SetLength(k+adr);
for (j=0;j<adr;j++) lzw[j+k+1]=buf[j];
}
return lzw;
}
//---------------------------------------------------------------------------
AnsiString LZW_decode(AnsiString lzw)
{
AnsiString raw="";
int adr,bit,siz,ix;
DWORD a;
AnsiString dict[LZW_size]; // dictionary
int dicts=0; // actual size of dictionary
// init dictionary
for (dicts=0;dicts<256;dicts++) dict[dicts]=char(dicts); // full 8bit binary alphabet
// for (dicts=0;dicts<4;dicts++) dict[dicts]=char('a'+dicts); // test alphabet "a,b,c,d"
siz=lzw.Length();
adr=0; bit=0; ix=-1;
for (adr=0;(adr<siz)||(bit>=LZW_bits);)
{
a=bitstream_read(lzw.c_str(),siz,adr,bit,LZW_bits);
// a-='0'; // test ASCII input
// clear dictionary if full
if (dicts>=LZW_size){ dicts=4; ix=-1; }
// new dictionary entry
if (ix>=0)
{
if (a>=dicts){ dict[dicts]=dict[ix]+AnsiString(dict[ix][1]); dicts++; }
else { dict[dicts]=dict[ix]+AnsiString(dict[a ][1]); dicts++; }
} ix=a;
// update decoded output
raw+=dict[a];
}
return raw;
}
//---------------------------------------------------------------------------
and output using // test ASCII input
lines:
txt="abacdacacadaad"
enc="0102369803"
dec="abacdacacadaad"
where AnsiString
is the only VCL stuff I used and its just self allocating string variable beware its indexes starts at 1
.
AnsiString s;
s[5] // character access (1 is first character)
s.Length() // returns size
s.c_str() // returns char*
s.SetLength(size) // resize
So just use any string you got ...
In case you do not have BYTE,DWORD
use unsigned char
and unsigned int
instead ...
Looks like its working for long texts too (bigger than dictionary and or bitstream buffer sizes). However beware that the clearing might be done in few different places of code but must be synchronized in both encoder/decoder otherwise after clearing the data would corrupt.
The example can use either just "a,b,c,d"
alphabet or full 8it one. Currently is set for 8bit. If you want to change it just un-rem the // test ASCII input
lines and rem out the // full 8bit binary alphabet
lines in the code.
To test crossing buffers and boundary you can play with:
const int LZW_bits=12; // encoded bitstream size
const int LZW_size=1<<LZW_bits; // dictinary size
and also with:
const int siz=32; // bitstream buffer
constants... The also affect performance so tweak to your liking.
Beware the bitstream_write
is not optimized and can be speed up considerably ...
Also in order to debug 4bit aligned coding I am using hex print of encoded data (hex string is twice as long as its ASCII version) like this (ignore the VCL stuff):
AnsiString txt="abacdacacadaadddddddaaaaaaaabcccddaaaaaaaaa",enc,dec,hex;
enc=LZW_encode(txt);
dec=LZW_decode(enc);
// convert to hex
hex=""; for (int i=1,l=enc.Length();i<=l;i++) hex+=AnsiString().sprintf("%02X",enc[i]);
mm_log->Lines->Add("\""+txt+"\"");
mm_log->Lines->Add("\""+hex+"\"");
mm_log->Lines->Add("\""+dec+"\"");
mm_log->Lines->Add(AnsiString().sprintf("ratio: %i%",(100*enc.Length()/dec.Length())));
and result:
"abacdacacadaadddddddaaaaaaaabcccddaaaaaaaaa"
"06106206106306410210510406106410FFFFFF910A10706110FFFFFFD10E06206311110910FFFFFFE11410FFFFFFD0"
"abacdacacadaadddddddaaaaaaaabcccddaaaaaaaaa"
ratio: 81%
来源:https://stackoverflow.com/questions/62086402/lzw-decompression