optimizing byte-pair encoding

后端未结

关注

 9  1010

Noticing that byte-pair encoding (BPE) is sorely lacking from the large text compression benchmark, I very quickly made a trivial literal implementation of

相关标签:

9条回答

野趣味

2020-12-30 11:36

Yes, keep us posted.

guarantee?

BobMcGee gives good advice. However, I suspect that "Limit the block size to less than 65kB ... . This guarantees not all bytes will be used" is not always true. I can generate a (highly artificial) binary file less than 1kB long that has a byte pair that repeats 10 times, but cannot be compressed at all with BPE because it uses all 256 bytes -- there are no free bytes that BPE can use to represent the frequent byte pair.

If we limit ourselves to 7 bit ASCII text, we have over 127 free bytes available, so all files that repeat a byte pair enough times can be compressed at least a little by BPE. However, even then I can (artificially) generate a file that uses only the isgraph() ASCII characters and is less than 30kB long that eventually hits the "no free bytes" limit of BPE, even though there is still a byte pair remaining with over 4 repeats.

single pass

It seems like this algorithm can be slightly tweaked in order to do it in one pass. Assuming 7 bit ASCII plaintext: Scan over input text, remembering all pairs of bytes that we have seen in some sort of internal data structure, somehow counting the number of unique byte pairs we have seen so far, and copying each byte to the output (with high bit zero). Whenever we encounter a repeat, emit a special byte that represents a byte pair (with high bit 1, so we don't confuse literal bytes with byte pairs). Include in the internal list of byte "pairs" that special byte, so that the compressor can later emit some other special byte that represents this special byte plus a literal byte -- so the net effect of that other special byte is to represent a triplet. As phkahler pointed out, that sounds practically the same as LZW.

EDIT: Apparently the "no free bytes" limitation I mentioned above is not, after all, an inherent limitation of all byte pair compressors, since there exists at least one byte pair compressor without that limitation.

Have you seen "SCZ - Simple Compression Utilities and Library"? SCZ appears to be a kind of byte pair encoder. SCZ apparently gives better compression than other byte pair compressors I've seen, because SCZ doesn't have the "no free bytes" limitation I mentioned above.

If any byte pair BP repeats enough times in the plaintext (or, after a few rounds of iteration, the partially-compressed text), SCZ can do byte-pair compression, even when the text already includes all 256 bytes.

(SCZ uses a special escape byte E in the compressed text, which indicates that the following byte is intended to represent itself literally, rather than expanded as a byte pair. This allows some byte M in the compressed text to do double-duty: The two bytes EM in the compressed text represent M in the plain text. The byte M (without a preceeding escape byte) in the compressed text represents some byte pair BP in the plain text. If some byte pair BP occurs many more times than M in the plaintext, then the space saved by representing each BP byte pair as the single byte M in the compressed data is more than the space "lost" by representing each M as the two bytes EM.)

0 讨论(0)
发布评论:

提交评论
- 加载中...
囚心锁ツ

2020-12-30 11:41

There is an O(n) version of byte-pair encoding which I describe here. I am getting a compression speed of ~200kB/second in Java.

0 讨论(0)
发布评论:

提交评论
- 加载中...
离开以前

2020-12-30 11:45

You can also optimize the dictionary so that:

AA1BB2CC3DD4EE5FF6GG7HH8 is a sequential run of 8 token.

Rewrite that as:

AA1<255>BBCCDDEEFFGGHH<255> where the <255> tells the program that each of the following byte pairs (up to the next <255>) are sequential and incremented by one. Works great for text files and any where there are at least 4 sequential tokens.

save 175 bytes on recent test.

0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2