问题
According to this FAQ on zlib.net it is possible to:
access data randomly in a compressed stream
I know about the module Bio.bgzf of Biopyton 1.60, which:
supports reading and writing BGZF files (Blocked GNU Zip Format), a variant of GZIP with efficient random access, most commonly used as part of the BAM file format and in tabix. This uses Python’s zlib library internally, and provides a simple interface like Python’s gzip library.
But for my use case I don't want to use that format. Basically I want something, which emulates the code below:
import gzip
large_integer_new_line_start = 10**9
with gzip.open('large_file.gz','rt') as f:
f.seek(large_integer_new_line_start)
but with the efficiency offered by the native zlib.net to provide random access to the compressed stream. How do I leverage that random access capability in Python?
回答1:
I gave up on doing random access on a gzipped file using Python. Instead I converted my gzipped file to a block gzipped file with a block compression/decompression utility on the command line:
zcat large_file.gz | bgzip > large_file.bgz
Then I used BioPython and tell to get the virtual_offset of line number 1 million of the bgzipped file. And then I was able to rapidly seek the virtual_offset afterwards:
from Bio import bgzf
file='large_file.bgz'
handle = bgzf.BgzfReader(file)
for i in range(10**6):
handle.readline()
virtual_offset = handle.tell()
line1 = handle.readline()
handle.close()
handle = bgzf.BgzfReader(file)
handle.seek(virtual_offset)
line2 = handle.readline()
handle.close()
assert line1==line2
I would like to also point to the SO answer by Mark Adler here on examples/zran.c in the zlib distribution.
回答2:
You are looking for dictzip.py
, part of the serpento package. However, you have to compress the files with dictzip
, which is a random seekable backward compatible variant of the gzip
compression.
回答3:
The indexed_gzip program might be what you wanted. It also uses zran.c
under the hood.
回答4:
If you just want to access the file from a random point can't you just do:
from random import randint
with open(filename) as f:
f.seek(0, 2)
size = f.tell()
f.seek(randint(0, size), 2)
来源:https://stackoverflow.com/questions/22950030/how-to-obtain-random-access-of-a-gzip-compressed-file