问题
We are working with large (1.2TB) uncompressed, unchunked hdf5 files with h5py in python for a machine learning application, which requires us to work through the full dataset repeatedly, loading slices of ~15MB individually in a randomized order. We are working on a linux (Ubuntu 18.04) machine with 192 GB RAM. We noticed that the program is slowly filling the cache. When total size of cache reaches size comparable with full machine RAM (free memory in top almost 0 but plenty ‘available’ memory) swapping occurs slowing down all other applications. In order to pinpoint the source of the problem, we wrote a separate minimal example to isolate our dataloading procedures - but found that the problem was independent of each part of our method.
We tried: Building numpy memmap and accessing requested slice:
#on init:
f = h5py.File(tv_path, 'r')
hdf5_event_data = f["event_data"]
self.event_data = np.memmap(tv_path, mode="r", shape=hdf5_event_data.shape,
offset=hdf5_event_data.id.get_offset(),dtype=hdf5_event_data.dtype)
self.e = np.ones((512,40,40,19))
#on __getitem__:
self.e = self.event_data[index,:,:,:19]
return self.e
Reopening the memmap on each call to getitem:
#on __getitem__:
self.event_data = np.memmap(self.path, mode="r", shape=self.shape,
offset=self.offset, dtype=self.dtype)
self.e = self.event_data[index,:,:,:19]
return self.e
Addressing the h5 file directly and converting to a numpy array:
#on init:
f = h5py.File(tv_path, 'r')
hdf5_event_data = f["event_data"]
self.event_data = hdf5_event_data
self.e = np.ones((512,40,40,19))
#on __getitem__:
self.e = self.event_data[index,:,:,:19]
return self.e
We also tried the above approaches within pytorch Dataset/Dataloader framework - but it made no difference.
We observe high memory fragmentation as evidenced by /proc/buddyinfo. Dropping the cache via sync; echo 3 > /proc/sys/vm/drop_caches doesn’t help while application is running. Cleaning cache before application starts removes swapping behaviour until cache eats up the memory again - and swapping starts again.
Our working hypothesis is that the system is trying to hold on to cached file data which leads to memory fragmentation. Eventually when new memory is requested swapping is performed even though most memory is still ‘available’.
As such, we turned to ways to change the Linux environment’s behaviour around file caching and found this post . Is there a way to call the POSIX_FADV_DONTNEED flag when opening an h5 file in python or a portion of that we accessed via numpy memmap, so that this accumulation of cache does not occur? In our use case we will not be re-visiting that particular file location for a long time (till we access all other remaining ‘slices’ of the file)
回答1:
You can use os.posix_fadvise to tell the OS how regions you plan to load will be used. This naturally requires a bit of low-level tweaking to determine your file descriptor, and get an idea of the regions you plan on reading.
The easiest way to get the file descriptor is to supply it yourself:
pf = open(tv_path, 'rb')
f = h5py.File(pf, 'r')
You can now set the advice. For the entire file:
os.posix_fadvise(os.fileno(pf), 0, f.id.get_filesize(), os.POSIX_FADV_DONTNEED)
Or for a particular dataset:
os.posix_fadvise(os.fileno(pf), hdf5_event_data.id.get_offset(),
hdf5_event_data.id.get_storage_size(), os.POSIX_FADV_DONTNEED)
Other things to look at
H5py does its own chunk caching. You may want to try turning this off:
f = h5py.File(..., rdcc_nbytes=0)
As an alternative, you may want to try using one of the other drivers provided in h5py, like 'sec2'
:
f = h5py.File(..., driver='sec2')
来源:https://stackoverflow.com/questions/62415606/is-there-a-way-to-open-hdf5-files-with-the-posix-fadv-dontneed-flag