I\'ve been doing some research on compression-based text classification and I\'m trying to figure out a way of storing a dictionary built by the encoder (on a training file)
deflate encoders, as in gzip and zlib, do not "build" a dictionary. They simply use the previous 32K bytes as a source for potential matches to the string of bytes starting at the current position. The last 32K bytes is called the "dictionary", but the name is perhaps misleading.
You can use zlib to experiment with preset dictionaries. See the deflateSetDictionary()
and inflateSetDictionary()
functions. In that case, zlib compression is primed with a "dictionary" of 32K bytes that effectively precede the first byte being compressed as a source for matches, but the dictionary itself is not compressed. The priming can only improve the compression of the first 32K bytes. After that, the preset dictionary is too far back to provide matches.
gzip provides no support for preset dictionaries.