How to find a good/optimal dictionary for zlib 'setDictionary' when processing a given set of data?

前端 未结 1 1203
南笙
南笙 2020-12-05 00:56

I have a (huge) set of similar data files. The set is constantly growing. The size of a single file is about 10K. Each file must be compressed on its own. The compression is

相关标签:
1条回答
  • 2020-12-05 01:45

    John Reiser explained on comp.compression:

    For the dictionary: make a histogram of short substrings, sort by payoff (number of occurrences times number of bits saved when compressed) and put the highest-payoff substrings into the dictionary. For example, if k is the length of the shortest substring that can be compressed (usually 3==k or 2==k), then make a histogram of all the substrings of lengths k, 1+k, 2+k, and 3+k. Of course there is some art to placing those substrings into the dictionary, taking advantage of substrings, overlapping, short strings nearer to the high-address end, etc.

    The Linux kernel uses a similar technique to compress names of symbols that are used for printing backtraces of the subroutine calling stack. See the file scripts/kallsyms.c. For instance, https://code.woboq.org/linux/linux/scripts/kallsyms.c.html

    The zlib manual recommends to place the most common ocurrences at the end of the dictionary.

    The dictionary should consist of strings (byte sequences) that are likely to be encountered later in the data to be compressed, with the most commonly used strings preferably put towards the end of the dictionary. Using a dictionary is most useful when the data to be compressed is short and can be predicted with good accuracy; the data can then be compressed better than with the default empty dictionary.

    This is because LZ77 has a sliding window algorithm, so the later substrings will be reachable further on your stream of data than the first few.

    I'd play with generating the dictionary with a higher level language with good support of strings. A crude JavaScript example:

    var str = "The dictionary should consist of strings (byte sequences) that"
        + " are likely to be encountered later in the data to be compressed,"
        + " with the most commonly used strings preferably put towards the "
        + "end of the dictionary. Using a dictionary is most useful when the"
        + " data to be compressed is short and can be predicted with good"
        + " accuracy; the data can then be compressed better than with the "
        + "default empty dictionary.";
    // Extract words, remove punctuation (extra: replace(/\s/g, " "))
    var words = str.replace(/[,\;.:\(\)]/g, "").split(" ").sort();
    var  wcnt = [], w = "", cnt = 0; // pairs, current word, current word count
    for (var i = 0, cnt = 0, w = ""; i < words.length; i++) {
        if (words[i] === w) {
            cnt++; // another match
        } else {
            if (w !== "")
                wcnt.push([cnt, w]); // Push a pair (count, word)
            cnt = 1; // Start counting for this word
            w = words[i]; // Start counting again
        }
    }
    if (w !== "")
        wcnt.push([cnt, w]); // Push last word
    wcnt.sort(); // Greater matches at the end
    for (var i in wcnt)
        wcnt[i] = wcnt[i][1]; // Just take the words
    var dict = wcnt.join("").slice(-70); // Join the words, take last 70 chars
    

    Then dict is a string of 70 chars with:

    rdsusedusefulwhencanismostofstringscompresseddatatowithdictionarybethe
    

    You can try it copy-paste-run here (add: "print(dict)")

    That's just whole words, not substrings. Also there are ways to overlap common substrings to save space on the dictionary.

    0 讨论(0)
提交回复
热议问题