问题
I'm trying to write data to a h5py dataset but using a high memory 12 core GCE instance to write to an SSD disk but it runs for 13 hours with no end in sight. I'm running Jupyter Notebook on the GCE instance to unpickle a large number of small files (stored on a 2nd non-ssd disk) before adding them to a h5py dataset in a file stored on the ssd disk
- Max shape=
(29914, 251328)
- Chunks =
(59, 982)
- compression =
gzip
- dtype =
float64
My code is listed below
#Get a sample
minsample = 13300
sampleWithOutReplacement = random.sample(ListOfPickles,minsample)
print(h5pyfile)
with h5py.File(h5pyfile, 'r+') as hf:
GroupToStore = hf.get('group')
DatasetToStore = GroupToStore.get('ds1')
#Unpickle the contents and add in th h5py file
for idx,files in enumerate(sampleWithOutReplacement):
#Sample the minimum number of examples
time FilePath = os.path.join(SourceOfPickles,files)
#Use this method to auto close the file
with open(FilePath,"rb") as f:
%time DatasetToStore[idx:] = pickle.load(f)
#print("Processing file ",idx)
print("File Closed")
The h5py File on disk seems to increase 1.4GB each dataset I populate using the code above and below is my code to create the dataset in the h5py file
group.create_dataset(labels, dtype='float64',shape= (maxSize, 251328),maxshape=(maxSize,251328),compression="gzip")
What improvements can I make to either my configuration or my code or both to reduce the time needed to populate the h5py file?
Update 1 I added some magic to the jupyter notebook to time the process, I'd welcome any advice on speeding up the loading into the datastore which was reported as taking 8hrs
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 14.1 µs
CPU times: user 8h 4min 11s, sys: 1min 18s, total: 8h 5min 30s
Wall time: 8h 5min 29s
回答1:
This seems very wrong: DatasetToStore[idx:]
You probably want: DatasetToStore[idx, ...]
I think your version overwrites every row after idx with the unpickled dataset on every iteration. This version only overwrites a single row to the dataset on each iteration.
回答2:
JRoose is right, something with the code seems to be wrong.
By default h5py uses only a chunk-cache of 1MB, which isn't enough for your problem. You could change your cache setting in the Low Level API or use h5py_cache instead. https://pypi.python.org/pypi/h5py-cache/1.0
Change the line
with h5py.File(h5pyfile, 'r+') as hf
to
with h5py_cache.File(h5pyfile, 'r+',chunk_cache_mem_size=500*1024**2) as hf
to increase the chunk cache for example to 500MB.
I assume
pickle.load(f)
results in a 1D Array; your Dataset is 2D. In this case there is nothing wrong when you write%time DatasetToStore[idx,:] = pickle.load(f)
but to my findings it would be rather slow. To increase the speed make a 2D array before passing the data to the Dataset.
%time DatasetToStore[idx:idx+1,:] = np.expand_dims(pickle.load(f), axis=0)
I don't really know why this is faster, but in my scripts this version is about 20 times faster than the version above. The same goes for reading from a HDF5 File.
来源:https://stackoverflow.com/questions/39087689/writing-data-to-h5py-on-ssd-disk-appears-slow-what-can-i-do-to-speed-it-up