问题
I'm passing thousands of .csv containing time
and amplitude
to a .hdf5 file. To give an example I used a small amount of .csv files which correspond to a total of ~11MB.
After passing all the .csv to hdf5, the latter has a size of ~36MB (without using compression="gzip"
).
By using compression="gzip"
the file size is around 38MB.
I understand that hdf5 is compressing the dataset only, that is, the numpy array in my case (~500 rows with float number).
To make a comparison, I was saving all the .csv in a .json file, compressing it and then reading. I chose hdf5 due to memory issues since the json file is loaded entirely into memory with a footprint 2x to Xx times larger than the file size.
This is how I append to a dataset in a .hdf5 file.
def hdf5_dump_dataset(hdf5_filename, hdf5_data, dsetname):
with h5py.File(hdf5_filename, 'a') as f:
dset = f.create_dataset(dsetname, data=hdf5_data, compression="gzip", chunks=True, maxshape=(None,))
This is how I read a dataset from a .hdf5 file.
def hdf5_load_dataset(hdf5_filename, dsetname):
with h5py.File(hdf5_filename, 'r') as f:
dset = f[dsetname]
return dset[...]
The folder structre with the .csv files:
root/
folder_1/
file_1.csv
file_X.csv
folder_X/
file_1.csv
file_X.csv
Inside each .csv file:
time, amplitude
1.000e-08, -1.432e-07
1.001e-08, 7.992e-07
1.003e-08, -1.838e-05
1.003e-08, 2.521e-05
script to save the .csv contents in a hdf5 file:
# csv_dict is a dict() with all folders and csv files as keys
# ex. csv_dict['folder_1']['file_1'] (without the .csv extension)
for folder in csv_dict:
for file in csv_dict[folder]:
path_waveform = f"{folder}/{file}.csv"
time, amplitude = self.read_csv_return_list_of_time_amplitude(path_waveform)
hdf5_dump_dataset(path_hdf5_waveforms, amplitude, '/'.join([folder, file, 'amplitude']))
hdf5_dump_dataset(path_hdf5_waveforms, time, '/'.join([folder, file, 'time']))
For each .csv file in each folder I have a dataset for the time
and for the amplitude
. The structure of the hdfile is like this:
folder1/file_1/time
folder1/file_1/amplitude
where
time = np.array([1.000e-08, 1.001e-08, 1.003e-08, ...]) # 500 items
amplitude = np.array([-1.432e-07, 7.992e-07, -1.838e-05, ...]) # 500 items
My question is: Is there a way to compress the whole hdf5 file?
来源:https://stackoverflow.com/questions/59987928/compressing-hdf5-files-with-h5py