问题
This issue seem to be existing for a long time and lots of users are facing the issue.
stream_executor/cuda/cuda_dnn.cc:444] could not convert BatchDescriptor {count: 0 feature_map_count: 64 spatial: 7 264 value_min: 0.000000 value_max: 0.000000 layout: BatchDepthYX} t o cudnn tensor descriptor: CUDNN_STATUS_BAD_PARAM
The message is so mysterious that I do not know what happened in my code, however, my code works fine on CPU tensorflow.
I heard that we can use tf.cond to get around this, but I'm new to tensorflow-gpu, so can someone please help me? My code uses Keras and takes generator like input, this is to avoid any out-of-memory issue. The generator is built by a while True loop that spits out data by some batch size.
def resnet_model(bin_multiple):
#input and reshape
inputs = Input(shape=input_shape)
reshape = Reshape(input_shape_channels)(inputs)
#normal convnet layer (have to do one initially to get 64 channels)
conv = Conv2D(64,(1,bin_multiple*note_range),padding="same",activation='relu')(reshape)
pool = MaxPooling2D(pool_size=(1,2))(conv)
for i in range(int(np.log2(bin_multiple))-1):
print( i)
#residual block
bn = BatchNormalization()(pool)
re = Activation('relu')(bn)
freq_range = int((bin_multiple/(2**(i+1)))*note_range)
print(freq_range)
conv = Conv2D(64,(1,freq_range),padding="same",activation='relu')(re)
#add and downsample
ad = add([pool,conv])
pool = MaxPooling2D(pool_size=(1,2))(ad)
flattened = Flatten()(pool)
fc = Dense(1024, activation='relu')(flattened)
do = Dropout(0.5)(fc)
fc = Dense(512, activation='relu')(do)
do = Dropout(0.5)(fc)
outputs = Dense(note_range, activation='sigmoid')(do)
model = Model(inputs=inputs, outputs=outputs)
return model
model = resnet_model(bin_multiple)
init_lr = float(args['init_lr'])
model.compile(loss='binary_crossentropy',
optimizer=SGD(lr=init_lr,momentum=0.9), metrics=['accuracy', 'mae', 'categorical_accuracy'])
model.summary()
history = model.fit_generator(trainGen.next(),trainGen.steps(), epochs=epochs,
verbose=1,validation_data=valGen.next(),validation_steps=valGen.steps(),callbacks=callbacks, workers=8, use_multiprocessing=True)
回答1:
The problem is when you model received 0 batch size. For me I had the error because I have 1000 example and I run it on multiple GPus ( 2 GPU) with batch size equal to 32 .And in My graph I divided the batch size to mini batch size to so each GPU take 16 example. At step 31 ( 31 * 32) I will finished 992 examples , so there is only 8 example left, it will go to GPU 1 and GPU2 will end with zero batch size that's why I received your error above.
Still couldn't solve it and still searching about proper solution. I hope this help you to discover when in your code you received zero batch size.
来源:https://stackoverflow.com/questions/47566281/tensorflow-gpu-crashes-for-0-batch-size-cudnn-status-bad-param