I am ordering a huge pile landsat scenes from the USGS, which come as tar.gz archives. I am writing a simple python script to unpack them. Each archive contains 15 tiff imag
The problem is that a tar
file does not have a central file list, but stores files sequentially with a header before each file. The tar
file is then compressed via gzip to give you tar.gz
. With a tar
file, if you don't want to extract a certain file, you simply skip the next header->size
bytes in an archive and then read the next header. If the archive is additionally compressed, you'll still have to skip that many bytes, only not within the archive file but within the decompressed data stream - which for some compression formats works, but for others requires you to decompress everything in between.
gzip belongs to the latter class of compression schemes. So while you save some time by not writing the undesired files to the disk, your code still decompresses them. You might be able to overcome that problem by overriding the _Stream class for non-gzip archives, but for your gz
files, there is nothing you can do about it.
You can do that more efficiently, by opening the tarfile as a stream.(https://docs.python.org/2/library/tarfile.html#tarfile.open)
mkdir tartest
cd tartest/
dd if=/dev/urandom of=file1 count=100 bs=1M
dd if=/dev/urandom of=file2 count=100 bs=1M
dd if=/dev/urandom of=file3 count=100 bs=1M
dd if=/dev/urandom of=file4 count=100 bs=1M
dd if=/dev/urandom of=file5 count=100 bs=1M
cd ..
tar czvf test.tgz tartest
Now read like this:
import tarfile
fileName = "test.tgz"
tfile = tarfile.open(fileName, 'r|gz')
for t in tfile:
if "file3" in t.name:
f = tfile.extractfile(t)
if f:
print(len(f.read()))
Note the |
in the open command. We only read the file3
.
$ time python test.py
104857600
real 0m1.201s
user 0m0.820s
sys 0m0.377s
If I change the r|gz
back to the r:gz
I get:
$ time python test.py
104857600
real 0m7.033s
user 0m6.293s
sys 0m0.730s
Roughly 5 times faster (since we have 5 equally sized files). It is so because the standard way of opening allows seeking backwards; it can only do so in a compressed tarfile by extracting (I do not know the exact reason for that). If you open as a stream, you cannot seek randomly any more but if you read sequentially, which is possible in your case, it is much faster. However, you cannot to the getnames
anymore beforehand. But that is not necessary in this case.