Compressing HDF5 files with H5Py

问题

I'm passing thousands of .csv containing time and amplitude to a .hdf5 file. To give an example I used a small amount of .csv files which correspond to a total of ~11MB.

After passing all the .csv to hdf5, the latter has a size of ~36MB (without using compression="gzip").

By using compression="gzip" the file size is around 38MB.

I understand that hdf5 is compressing the dataset only, that is, the numpy array in my case (~500 rows with float number).

To make a comparison, I was saving all the .csv in a .json file, compressing it and then reading. I chose hdf5 due to memory issues since the json file is loaded entirely into memory with a footprint 2x to Xx times larger than the file size.

This is how I append to a dataset in a .hdf5 file.

def hdf5_dump_dataset(hdf5_filename, hdf5_data, dsetname):
        with h5py.File(hdf5_filename, 'a') as f:
            dset = f.create_dataset(dsetname, data=hdf5_data, compression="gzip", chunks=True, maxshape=(None,))

This is how I read a dataset from a .hdf5 file.

def hdf5_load_dataset(hdf5_filename, dsetname):
        with h5py.File(hdf5_filename, 'r') as f:
            dset = f[dsetname]
            return dset[...]

The folder structre with the .csv files:

root/
    folder_1/
        file_1.csv
        file_X.csv  
    folder_X/
        file_1.csv
        file_X.csv

Inside each .csv file:

time, amplitude
1.000e-08, -1.432e-07
1.001e-08, 7.992e-07
1.003e-08, -1.838e-05
1.003e-08, 2.521e-05

script to save the .csv contents in a hdf5 file:

# csv_dict is a dict() with all folders and csv files as keys
# ex. csv_dict['folder_1']['file_1']  (without the .csv extension)

for folder in csv_dict:
    for file in csv_dict[folder]:
        path_waveform = f"{folder}/{file}.csv"
        time, amplitude = self.read_csv_return_list_of_time_amplitude(path_waveform)

        hdf5_dump_dataset(path_hdf5_waveforms, amplitude, '/'.join([folder, file, 'amplitude']))

        hdf5_dump_dataset(path_hdf5_waveforms, time, '/'.join([folder, file, 'time']))

For each .csv file in each folder I have a dataset for the time and for the amplitude. The structure of the hdfile is like this:

folder1/file_1/time
folder1/file_1/amplitude

where

time = np.array([1.000e-08, 1.001e-08, 1.003e-08, ...])  # 500 items
amplitude = np.array([-1.432e-07, 7.992e-07, -1.838e-05, ...])  # 500 items

My question is: Is there a way to compress the whole hdf5 file?

来源：https://stackoverflow.com/questions/59987928/compressing-hdf5-files-with-h5py

标签

python

python-3.x

hdf5

h5py