问题
I have several big HDF5 file stored on an SSD (lzf compressed file size is 10–15 GB, uncompressed size would be 20–25 GB). Reading the contents from such a file into RAM for further processing takes roughly 2 minutes per file. During that time only one core is utilized (but to 100%). So I guess the decompression part running on CPU is the bottleneck and not the IO throughput of the SSD.
At the start of my program it reads multiple files of that kind into RAM, which takes quite some time. I like to speed up that process by utilizing more cores and eventually more RAM, until the SSD IO throughput is the limiting factor. The machine I'm working on has plenty resources (20 CPU cores [+ 20 HT] and 400 GB RAM) and »wasting« RAM is no big deal, as long as it is justified by saving time.
I had two ideas on my own:
1) Use python's multiprocessing
module to read several files into RAM in parallel. This works in principle, but due to the usage of Pickle within multiprocessing
(as stated here), I hit the 4 GB serialization limit:
OverflowError('cannot serialize a bytes object larger than 4 GiB').
2) Make several processes (using a Pool
from the multiprocessing
module) open the same HDF5 file (using with h5py.File('foo.h5', 'r') as h_file:
), read an individual chunk from it (chunk = h_file['label'][i : i + chunk_size]
) and return that chunk. The gathered chunks will then be concatenated. However, this fails with an
OSError: Can't read data (data error detected by Fletcher32 checksum).
Is this due to the fact, that I open the very same file within multiple processes (as suggested here)?
So my final question is: How can I read the content of the .h5
files faster into main memory? Again: »Wasting« RAM in favor of saving time is permitted. The contents have to reside in main memory, so circumventing the problem by just reading lines, or fractions, is not an option.
I know that I could just store the .h5
files uncompressed, but this is just the last option I like to use, since space on the SSD is scarce. I prefer haven both, compressed files and fast read (ideally by better utilizing the available resources).
Meta information: I use python 3.5.2 and h5py 2.8.0.
EDIT: While reading the file, the SSD works with a speed of 72 MB/s, far from its maximum. The .h5
files were created by using h5py's create_dataset method with the compression="lzf"
option.
EDIT 2: This is (simplified) the code I use to read the content of a (compressed) HDF5 file:
def opener(filename, label): # regular version
with h5py.File(filename, 'r') as h_file:
data = g_file[label][:]
return data
def fast_opener(filename, label): # multiple processes version
with h5py.File(filename, 'r') as h_file:
length = len(h_file[label])
pool = Pool() # multiprocessing.Pool and not multiprocessing.dummy.Pool
args_iter = zip(
range(0, length, 1000),
repeat(filename),
repeat(label),
)
chunks = pool.starmap(_read_chunk_at, args_iter)
pool.close()
pool.join()
return np.concatenate(chunks)
def _read_chunk_at(index, filename, label):
with h5py.File(filename, 'r') as h_file:
data = h_file[label][index : index + 1000]
return data
As you can see, the decompression is done by h5py transparently.
回答1:
h5py
handles decompression of LZF files via a filter. The source code of the filter, implemened in C, is available on the h5py Github here. Looking at the implementation of lzf_decompress, which is the function causing your bottleneck, you can see it's not parallelized (No idea if it's even parallelizable, I'll leave that judgement to people more familiar to LZF inner workings).
With that said, I'm afraid there's no way to just take your huge compressed file and multithread-decompress it. Your options, as far as I can tell, are:
- Split the huge file in smaller individually-compressed chunks, parallel-decompress each chunk on a separate core (
multiprocessing
might help there but you'll need to take care about inter-process shared memory) and join everything back together after it's decompressed. - Just use uncompressed files.
来源:https://stackoverflow.com/questions/55296989/how-to-speed-up-reading-from-compressed-hdf5-files