saving and loading large numpy matrix

生来就可爱ヽ(ⅴ<●) 提交于 2020-03-05 07:10:10

问题


The below code is how I save the numpy array and it is about 27GB after saved. There are more than 200K images data and each shape is (224,224,3)

hf = h5py.File('cropped data/features_train.h5', 'w')
for i,each in enumerate(features_train):
    hf.create_dataset(str(i), data=each)
hf.close()

This is the method I used to load the data, and it takes hours for loading.

features_train = np.zeros(shape=(1,224,224,3))    
hf =  h5py.File('cropped data/features_train.h5', 'r') 
for key in hf.keys():
    x = hf.get(key)
    x = np.array(x)
    features_train = np.append(features_train,np.array([x]),axis=0) 
hf.close()

So, does anyone has a better solution for this large size of data?


回答1:


You didn't tell us how much physical RAM your server has, but 27 GiB sounds like "a lot". Consider breaking your run into several smaller batches.

There is an old saw in java land that asks "why does this have quadratic runtime?", that is, "why is this so slow?"

String s = ""
for (int i = 0; i < 1e6, i++) {
    s += "x";
}

The answer is that toward the end, on each iteration we are reading ~ a million characters then writing them, then appending a single character. The cost is O(1e12). Standard solution is to use a StringBuilder so we're back to the expected O(1e6).

Here, I worry that calling np.append() pushes us into the quadratic regime.

To verify, replace the features_train assignment with a simple evaluation of np.array([x]), so we spend a moment computing and then immediately discarding that value on each iteration. If the conjecture is right, runtime will be much smaller.

To remedy it, avoid calling .append(). Rather, preallocate 27 GiB with np.zeros() (or np.empty()) and then within the loop assign each freshly read array into the offset of its preallocated slot. Linear runtime will allow the task to complete much more quickly.



来源:https://stackoverflow.com/questions/58685930/saving-and-loading-large-numpy-matrix

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!