This is similar to a previous question, but the answers there don\'t satisfy my needs and my question is slightly different:
I currently use gzip compression for som
The gzip format can be randomly accessed provided an index has been previously created, as it is demonstrated on zlib's zran.c source code.
I've developed a command line tool upon zlib's zran.c which creates indexes for gzip files: https://github.com/circulosmeos/gztool
It can even create an index for a still-growing gzip file (for example a log created by rsyslog directly in gzip format) thus reducing in the practice to zero the time of index creation. See the -S
(Supervise) option.
Solutions exist for providing random access to gzip and bzip2 archives:
(I'm looking for something for 7zip)
I am the author of an open-source tool for compressing a particular type of biological data. This tool, called starch
, splits the data by chromosome and uses those divisions as indices for fast access to compressed data units within the larger archive.
Per-chromosome data are transformed to remove redundancy in genomic coordinates, and the transformed data are compressed with either bzip2
or gzip
algorithms. The offsets, metadata and compressed genomic data are concatenated into one file.
Source code is available from our GitHub site. We have compiled it under Linux and Mac OS X.
For your case, you could store (10 MB, or whatever) offsets in a header to a custom archive format. You parse the header, retrieve the offsets, and incrementally fseek
through the file by current_offset_sum
+ header_size
.
Because lossless compression works better on some areas than others, if you store compressed data into blocks of convenient length BLOCKSIZE, even though each block has exactly the same number of compressed bytes, some compressed blocks will expand to a much longer piece of plaintext than others.
You might look at "Compression: A Key for Next-Generation Text Retrieval Systems" by Nivio Ziviani, Edleno Silva de Moura, Gonzalo Navarro, and Ricardo Baeza-Yates in Computer magazine November 2000 http://doi.ieeecomputersociety.org/10.1109/2.881693
Their decompressor takes 1, 2, or 3 whole bytes of compressed data and decompresses (using a vocabulary list) into a whole word. One can directly search the compressed text for words or phrases, which turns out to be even faster than searching uncompressed text.
Their decompressor lets you point to any word in the text with a normal (byte) pointer and start decompressing immediately from that point.
You can give every word a unique 2 byte code, since you probably have less than 65,000 unique words in your text. (There are almost 13,000 unique words in the KJV Bible). Even if there are more than 65,000 words, it's pretty simple to assign the first 256 two-byte code "words" to all possible bytes, so you can spell out words that aren't in the lexicon of the 65,000 or so "most frequent words and phrases". (The compression gained by packing frequent words and phrases into two bytes is usually worth the "expansion" of occasionally spelling out a word using two bytes per letter). There are a variety of ways to pick a lexicon of "frequent words and phrases" that will give adequate compression. For example, you could tweak a LZW compressor to dump "phrases" it uses more than once to a lexicon file, one line per phrase, and run it over all your data. Or you could arbitrarily chop up your uncompressed data into 5 byte phrases in a lexicon file, one line per phrase. Or you could chop up your uncompressed data into actual English words, and put each word -- including the space at the beginning of the word -- into the lexicon file. Then use "sort --unique" to eliminate duplicate words in that lexicon file. (Is picking the perfect "optimum" lexicon wordlist still considered NP-hard?)
Store the lexicon at the beginning of your huge compressed file, pad it out to some convenient BLOCKSIZE, and then store the compressed text -- a series of two byte "words" -- from there to the end of the file. Presumably the searcher will read this lexicon once and keep it in some quick-to-decode format in RAM during decompression, to speed up decompressing "two byte code" to "variable-length phrase". My first draft would start with a simple one line per phrase list, but you might later switch to storing the lexicon in a more compressed form using some sort of incremental coding or zlib.
You can pick any random even byte offset into the compressed text, and start decompressing from there. I don't think it's possible to make a finer-grained random access compressed file format.
The .xz file format (which uses LZMA compression) seems to support this:
Random-access reading: The data can be split into independently compressed blocks. Every .xz file contains an index of the blocks, which makes limited random-access reading possible when the block size is small enough.
This should be sufficient for your purpose. A drawback is that the API of liblzma (for interacting with these containers) does not seem that well-documented, so it may take some effort figuring out how to randomly access blocks.
bgzip
can compress files in a gzip
variant which is indexable (and can be decompressed by gzip
). This is used in some bioinformatics applications, together with the tabix
indexer.
See explanations here: http://blastedbio.blogspot.fr/2011/11/bgzf-blocked-bigger-better-gzip.html, and here: http://www.htslib.org/doc/tabix.html.
I don't know to what extent it is adaptable to other applications.