random seek in 7z single file archive

夙愿已清 提交于 2019-12-03 09:03:07

It's technically possible, but if your question is "does the currently available binary 7zip command line tool allows that', the answer is unfortunately no. The best it allows is to compress independantly each file into the archive, allowing the files to be retrieved directly. But since what you want to compress is a single (huge) file, this trick will not work.

I'm afraid the only way is to chunk your file into small blocks, and to feed them to an LZMA encoder (included in LZMA SDK). Unfortunately that requires some programming skills.

Note : a technically inferior but trivial compression algorithm can be found here. The main program does just what you are looking for : cut the source file into small blocks, and feed them one by one to a compressor (in this case, LZ4). The decoder then does the reverse operation. It can easily skip all the compressed blocks and go straight to the one you want to retrieve. http://code.google.com/p/lz4/source/browse/trunk/lz4demo.c

How about this:

Concept: because you are basically reading only one file, index the .7z by block.

read the compressed file block by block, give each block a number and possibly an offset in the large file. scan for target item anchors in the data stream (eg. wikipedia article titles). For each anchor record save the blocknumber where the item began (that was maybe in the block before)

write the index to some kind of O(log n) store. For an access, retrieve the blocknumber and its offset, extract the block and find the item. the cost is bound to extraction of one block (or very few) and the string search in that block.

for this you have to read through the file once, but you can stream it and discard it after processing, so nothing hits the disk.

DARN: you basically postulated this in you question... it seems advantageous to read the question before answering...

7z archiver says that this file has a single block of data, compressed with LZMA algorithm.

What was the 7z / xz command to find is it single compressed block or not? Will 7z create multiblock (multistream) archive when used with several threads?

The original file is very huge (999gb xml)

The good news: wikipedia switched to multistream archives for its dumps (at least for enwiki): http://dumps.wikimedia.org/enwiki/

For example, most recent dump, http://dumps.wikimedia.org/enwiki/20140502/ has multistream bzip2 (with separate index "offset:export_article_id:article_name"), and the 7z dump is stored in many sub-GB archives with ~3k (?) articles per archive:

Articles, templates, media/file descriptions, and primary meta-pages, in multiple bz2 streams, 100 pages per stream

enwiki-20140502-pages-articles-multistream.xml.bz2 10.8 GB
enwiki-20140502-pages-articles-multistream-index.txt.bz2 150.3 MB

All pages with complete edit history (.7z)

enwiki-20140502-pages-meta-history1.xml-p000000010p000003263.7z 213.3 MB
enwiki-20140502-pages-meta-history1.xml-p000003264p000005405.7z 194.5 MB
enwiki-20140502-pages-meta-history1.xml-p000005406p000008209.7z 216.1 MB
enwiki-20140502-pages-meta-history1.xml-p000008210p000010000.7z 158.3 MB
enwiki-20140502-pages-meta-history2.xml-p000010001p000012717.7z 211.7 MB
 .....
enwiki-20140502-pages-meta-history27.xml-p041211418p042648840.7z 808.6 MB

I think, we can use bzip2 index to estimate article id even for 7z dumps, and then we just need the 7z archive with the right range (..p first_id p last_id .7z). stub-meta-history.xml may help too.

FAQ for dumps: http://meta.wikimedia.org/wiki/Data_dumps/FAQ

Only use:

7z e myfile_xml.7z -so | sed [something] 

Example get line 7:

7z e myfile_xml.7z -so | sed -n 7p

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!