问题
I have two 60 x 80921 matrices, one filled with data, one with reference.
I would like to store the values as key/value pairs in two different LMDBs, one for training (say I'll slice around the 60000 column mark) and one for testing. Here is my idea; does it work?
X_train = X[:,:60000]
Y_train = Y[:,:60000]
X_test = X[:,60000:]
Y_test = Y[:,60000:]
X_train = X_train.astype(int)
X_test = X_test.astype(int)
Y_train = Y_train.astype(int)
Y_test = Y_test.astype(int)
map_size = X_train.nbytes * 10
env = lmdb.open('sensormatrix_train_lmdb', map_size=map_size)
with env.begin(write=True) as txn:
for i in range(60):
for j in range(60000):
datum = caffe.proto.caffe_pb2.Datum()
datum.height = X_train.shape[0]
datum.width = X_train.shape[1]
datum.data = X_train[i,j].tobytes()
datum.label= int(Y[i,j])
str_id= '{:08}'.format(i)
I'm really not sure of the code. And what does the last line format(i)
refer to?
回答1:
It's not 100% clear what you are trying to do: are you treating each entry as a separate data sample, or are you trying to train on 60K 1D vectors of dim=60...
Assuming you have 60K training samples of dim 60, you can write the training lmdbs like this:
env_x = lmdb.open('sensormatrix_train_x_lmdb', map_size=map_size) # you can put map_size a little bigger
env_y = lmdb.open('sensormatrix_train_y_lmdb', map_size=map_size)
with env_x.begin(write=True) as txn_x, env_y.begin(write=True) as txn_y:
for i in xrange(X_train.shape[1]):
x = X_train[:,i]
y = Y_train[:,i]
datum_x = caffe.io.array_to_datum(arr=x.reshape((60,1,1)),label=i)
datum_y = caffe.io.array_to_datum(arr=y.reshape((60,1,1)),label=i)
keystr = '{:0>10d}'.format(i) # format an lmdb key for this entry
txn_x.put( keystr, datum_x.SerializeToString() ) # actual write to lmdb
txn_y.put( keystr, datum_y.SerializeToString() )
Now you have two lmdb for training, in your 'prototxt'
you should have two corresponding "Data"
layers:
layer {
name: "input_x"
top: "x"
top: "idx_x"
type: "Data"
data_param { source: "sensormatrix_train_x_lmdb" batch_size: 32 }
include { phase: TRAIN }
}
layer {
name: "input_y"
top: "y"
top: "idx_y"
type: "Data"
data_param { source: "sensormatrix_train_y_lmdb" batch_size: 32 }
include { phase: TRAIN }
}
To make sure you read corresponding x
y
pairs, you can add a sanity check
layer {
name: "sanity"
type: "EuclideanLoss"
bottom: "idx_x"
bottom: "idx_y"
top: "sanity"
loss_weight: 0
propagate_down: false
propagate_down: false
}
来源:https://stackoverflow.com/questions/36447505/creating-large-lmdbs-for-caffe-with-numpy-arrays