问题
I want to store web pages in compressed text files (CSV). To achieve the optimal compression, I would like to provide a set of 1000 web pages. The library should then spend some time creating the optimal "dictionary" for this content. One obvious "dictionary" entry could be <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
, which could get stored as %1 or something like that because it is present on almost all web pages. By creating a customized dictionary like this, the compression rates should be 99% in my case.
My question is, does a library for doing this exist on Windows with MIT or similar liberal licensing exist? If not, are there any general purpose compression libaries you would recommend. I have tried a bit with zlib, but it outputs binary data. If I would convert this binary data into text, I am worried that the result might be longer than the original text.
EDIT: I need to be able to store the text in CSV files and still be able to import them into a database or even Excel.
回答1:
"text files (not binary)" is a little too general. If you mean that some byte values (00,1A or whatever) can't be used, then any binary method + something like base64 coding can be used. (Although I'd suggest a more efficient method from Coroutine demo source).
To be specific, you can use any general-purpose compressor to compress your base file, then base file + target file, then diff these, and you'd get a dictionary compression (binary), which can be then converted to "text" with base64 or yenc or whatever.
Alternatively, there're some coders with build-in support for that, for example
http://compression.ru/ds/ppmtrain.rar
http://code.google.com/p/lzham/If you actually want to have common phrases replaced with references, and all other things left untouched (what is kinda implied, but not equals to "text output"), you can use text preprocessors like:
http://xwrt.sourceforge.net/
http://compression.ru/ds/liptify.rar (There were more afair).Also a hybrid method is possible. You can use a general-purpose LZ compressor like in [1], for example lzma, then replace its entropy coding with something text-based. For example, in http://nishi.dreamhosters.com/u/lzmarec_v1_bin.rar there's an utility which removes LZMA's entropy coding, and its pretty easy to convert its output to text.
来源:https://stackoverflow.com/questions/5220122/library-to-compress-text-data-and-store-it-as-text