问题
The below code is how I save the numpy array and it is about 27GB after saved. There are more than 200K images data and each shape is (224,224,3)
hf = h5py.File('cropped data/features_train.h5', 'w')
for i,each in enumerate(features_train):
hf.create_dataset(str(i), data=each)
hf.close()
This is the method I used to load the data, and it takes hours for loading.
features_train = np.zeros(shape=(1,224,224,3))
hf = h5py.File('cropped data/features_train.h5', 'r')
for key in hf.keys():
x = hf.get(key)
x = np.array(x)
features_train = np.append(features_train,np.array([x]),axis=0)
hf.close()
So, does anyone has a better solution for this large size of data?
回答1:
You didn't tell us how much physical RAM your server has, but 27 GiB sounds like "a lot". Consider breaking your run into several smaller batches.
There is an old saw in java land that asks "why does this have quadratic runtime?", that is, "why is this so slow?"
String s = ""
for (int i = 0; i < 1e6, i++) {
s += "x";
}
The answer is that toward the end, on each iteration we are reading ~ a million characters then writing them, then appending a single character. The cost is O(1e12). Standard solution is to use a StringBuilder so we're back to the expected O(1e6).
Here, I worry that calling np.append()
pushes us into the quadratic regime.
To verify, replace the features_train
assignment with a simple evaluation
of np.array([x])
, so we spend a moment computing and then immediately discarding
that value on each iteration.
If the conjecture is right, runtime will be much smaller.
To remedy it, avoid calling .append()
.
Rather, preallocate 27 GiB with np.zeros()
(or np.empty())
and then within the loop assign each freshly read array
into the offset of its preallocated slot.
Linear runtime will allow the task to complete much more quickly.
来源:https://stackoverflow.com/questions/58685930/saving-and-loading-large-numpy-matrix