可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have about 0.8 million images of 256x256 in RGB, which amount to over 7GB.

I want to use them as training data in a Convolutional Neural Network, and want to put them in a cPickle file, along with their labels.

Now, this is taking a lot of memory, to the extent that it needs to swap with my hard drive memory, and almost consume it all.

Is this is a bad idea?

What would be the smarter/more practical way to load into CNN or pickle them without causing too much memory issue?

This is what the code looks like

import numpy as np import cPickle from PIL import Image import sys,os  pixels = [] labels = [] traindata = [] data=[]   for subdir, dirs, files in os.walk('images'):         curdir=''         for file in files:                 if file.endswith(".jpg"):                         floc=str(subdir)+'/'+str(file)                         im= Image.open(floc)                         pix=np.array(im.getdata())                         pixels.append(pix)                         labels.append(1) pixels=np.array(pixels) labels=np.array(labels) traindata.append(pixels) traindata.append(labels) traindata=np.array(traindata) .....# do the same for validation and test data .....# put all data and labels into 'data' array cPickle.dump(data,open('data.pkl','wb'))

回答1:

Is this is a bad idea?

Yes, indeed.

You are trying to load 7GB of compressed image data into memory all at once (about 195 GB for 800k 256*256 RGB files). This will not work. You have to find a way to update your CNN image-by-image, saving the state as you go along.

Also consider how large your CCN parameter set will be. Pickle is not designed for large amounts of data. If you need to store GB worth of neural net data, you're much better off using a database. If the neural net parameter set is only a few MB, pickle will be fine, though.

You might also want to take a look at the documentation for pickle.HIGHEST_PROTOCOL, so you are not stuck with an old and unoptimized pickle file format.

文章来源: cPickle very large amount of data

标签

cpi

array