What is the current state of text-only compression algorithms?

前端 未结 3 1268
猫巷女王i
猫巷女王i 2021-01-30 13:52

In honor of the Hutter Prize, what are the top algorithms (and a quick description of each) for text compression?

Note: The intent of this question is to get a descript

3条回答
  •  遥遥无期
    2021-01-30 14:43

    The boundary-pushing compressors combine algorithms for insane results. Common algorithms include:

    • The Burrows-Wheeler Transform and here - shuffle characters (or other bit blocks) with a predictable algorithm to increase repeated blocks which makes the source easier to compress. Decompression occurs as normal and the result is un-shuffled with the reverse transform. Note: BWT alone doesn't actually compress anything. It just makes the source easier to compress.
    • Prediction by Partial Matching (PPM) - an evolution of arithmetic coding where the prediction model(context) is created by crunching statistics about the source versus using static probabilities. Even though its roots are in arithmetic coding, the result can be represented with Huffman encoding or a dictionary as well as arithmetic coding.
    • Context Mixing - Arithmetic coding uses a static context for prediction, PPM dynamically chooses a single context, Context Mixing uses many contexts and weighs their results. PAQ uses context mixing. Here's a high-level overview.
    • Dynamic Markov Compression - related to PPM but uses bit-level contexts versus byte or longer.
    • In addition, the Hutter prize contestants may replace common text with small-byte entries from external dictionaries and differentiate upper and lower case text with a special symbol versus using two distinct entries. That's why they're so good at compressing text (especially ASCII text) and not as valuable for general compression.

    Maximum Compression is a pretty cool text and general compression benchmark site. Matt Mahoney publishes another benchmark. Mahoney's may be of particular interest because it lists the primary algorithm used per entry.

提交回复
热议问题