问题
There is a dramatic slowdown when inserting many datasets into a group.
I have found that the slowdown point is proportional to the length of the name and number of datasets. A larger dataset does take a bit longer to insert but it didn't affect when the slowdown occurred.
The following example exaggerates the length of the name just to illustrate the point without waiting a long time.
- Python 3
- HDF5 Version 1.8.15 (1.10.1 gets even slower)
- h5py version: 2.6.0
Example:
import numpy as np
import h5py
import time
hdf = h5py.File('dummy.h5', driver='core', backing_store=False)
group = hdf.create_group('some_group')
dtype = np.dtype([
('name', 'a20'),
('x', 'f8'),
('y', 'f8'),
('count', 'u8'),
])
ds = np.array([('something', 123.4, 567.8, 20)], dtype=dtype)
long_name = 'abcdefghijklmnopqrstuvwxyz'*50
t = time.time()
size = 1000*25
for i in range(1, size + 1):
group.create_dataset(
long_name+str(i),
(len(ds),),
maxshape=(None,),
chunks=True,
compression='gzip',
compression_opts=9,
shuffle=True,
fletcher32=True,
dtype=dtype,
data=ds
)
if i % 1000 == 0:
dt = time.time() - t
t = time.time()
print('{0} / {1} - Rate: {2:.1f} inserts per second'.format(i, size, 1000/dt))
hdf.close()
Output:
1000 / 25000 - Rate: 1590.9 inserts per second
2000 / 25000 - Rate: 1770.0 inserts per second
...
17000 / 25000 - Rate: 1724.7 inserts per second
18000 / 25000 - Rate: 106.3 inserts per second
19000 / 25000 - Rate: 66.9 inserts per second
20000 / 25000 - Rate: 66.9 inserts per second
21000 / 25000 - Rate: 67.5 inserts per second
22000 / 25000 - Rate: 68.4 inserts per second
23000 / 25000 - Rate: 47.7 inserts per second
24000 / 25000 - Rate: 42.0 inserts per second
25000 / 25000 - Rate: 39.8 inserts per second
Again, I exaggerated the length of the name just to reproduce the issue quickly. In my problem the length of the name is about 25 characters and the slowdown point occurs after ~700k datasets are in a group. After ~1.4M datasets it gets even slower.
Why is this happening?
Is there a solution/remedy?
回答1:
Try using libver='latest' when you open the file. Recent versions of the library vastly improved the speed for adding items to a group, but for compatibility reasons this is only enabled with the above option.
来源:https://stackoverflow.com/questions/45023488/inserting-many-hdf5-datasets-very-slow