TensorFlow2-tf.keras: Loss and model weights suddenly become 'nan' when training MTCNN PNet

问题

I was trying to use tfrecords to train the PNet of MTCNN. At first the loss was decreasing smoothly for the first few epochs and then it became 'nan' and so did the model weights.

Below are my model structure and training results:

def pnet_train1(train_with_landmark = False):

    X = Input(shape = (12, 12, 3), name = 'Pnet_input')

    M = Conv2D(10, 3, strides = 1, padding = 'valid', kernel_initializer = glorot_normal, kernel_regularizer = l2(0.00001), name = 'Pnet_conv1')(X)
    M = PReLU(shared_axes = [1, 2], name = 'Pnet_prelu1')(M)
    M = MaxPooling2D(pool_size = 2, name = 'Pnet_maxpool1')(M) # default 'pool_size' is 2!!! 

    M = Conv2D(16, 3, strides = 1, padding = 'valid', kernel_initializer = glorot_normal, kernel_regularizer = l2(0.00001), name = 'Pnet_conv2')(M)
    M = PReLU(shared_axes= [1, 2], name = 'Pnet_prelu2')(M)

    M = Conv2D(32, 3, strides = 1, padding = 'valid', kernel_initializer = glorot_normal, kernel_regularizer = l2(0.00001), name = 'Pnet_conv3')(M)
    M = PReLU(shared_axes= [1, 2], name = 'Pnet_prelu3')(M)

    Classifier_conv = Conv2D(1, 1, activation = 'sigmoid', name = 'Pnet_classifier_conv', kernel_initializer = glorot_normal)(M)
    Bbox_regressor_conv = Conv2D(4, 1, name = 'Pnet_bbox_regressor_conv', kernel_initializer = glorot_normal)(M)
    Landmark_regressor_conv = Conv2D(12, 1, name = 'Pnet_landmark_regressor_conv', kernel_initializer = glorot_normal)(M)

    Classifier = Reshape((1, ), name = 'Pnet_classifier')(Classifier_conv)
    Bbox_regressor = Reshape((4, ), name = 'Pnet_bbox_regressor')(Bbox_regressor_conv) 
    if train_with_landmark: 
        Landmark_regressor = Reshape((12, ), name = 'Pnet_landmark_regressor')(Landmark_regressor_conv)
        Pnet_output = Concatenate()([Classifier, Bbox_regressor, Landmark_regressor]) 
        model = Model(X, Pnet_output) 
    else:
        Pnet_output = Concatenate()([Classifier, Bbox_regressor])
        model = Model(X, Pnet_output)

    return model

model = pnet_train1(True)
model.compile(optimizer = Adam(lr = 0.001), loss = custom_loss)
model.fit(ds, steps_per_epoch = 1636, epochs = 100, validation_data = ds, validation_steps = 1636)

pnet training records
I know there could be some reasons so I tried the following checks already:

Check the data set to see if there are bad data:
My data set is:
X: images of shape (12, 12, 3);
Y: label, box regression coords & 6-landmark regression coords concatenated together of shape (17, ).
For the label, it could be 1, -1, 0, -2 where only labels 1 and 0 will participate in calculating the custom loss I wrote myself.
For the roi & landmark coords, they all belong to [-1, 1].
For the image data, it will be processed as: (x - 127.5) / 128. before sending into the training stream.
To verify if it is the data who results in the 'nan' loss, I extract one batch (say like 1792 samples) from the dataset as numpy arrays ((1792, 12, 12, 3), (1792, 17)). Just training this one batch of data still raise the problem. At the epoch right before the loss become 'nan', the loss seems normal and all the model weights belong to (-1, 1), which are all very small values:

model.fit(x, y, batch_size = 896, epochs = 10)
Train on 1792 samples
Epoch 1/10
1792/1792 [==============================] - 0s 74us/sample - loss: 0.1579
Epoch 2/10
1792/1792 [==============================] - 0s 66us/sample - loss: 0.1574
Epoch 3/10
1792/1792 [==============================] - 0s 66us/sample - loss: 0.1567
Epoch 4/10
1792/1792 [==============================] - 0s 65us/sample - loss: 0.1550
Epoch 5/10
1792/1792 [==============================] - 0s 61us/sample - loss: 0.1556
Epoch 6/10
1792/1792 [==============================] - 0s 70us/sample - loss: 0.1527
Epoch 7/10
1792/1792 [==============================] - 0s 71us/sample - loss: 0.1532
Epoch 8/10
1792/1792 [==============================] - 0s 67us/sample - loss: 0.1509
Epoch 9/10
1792/1792 [==============================] - 0s 66us/sample - loss: 0.1501
Epoch 10/10
1792/1792 [==============================] - 0s 67us/sample - loss: 0.1495
Out[111]: <tensorflow.python.keras.callbacks.History at 0x1f767efa088>

temp_weights_list = []
for layer in model.layers:

    temp_layer = model.get_layer(layer.name)
    temp_weights = temp_layer.get_weights()
    temp_weights_list.append(temp_weights)

model.fit(x, y, batch_size = 896, epochs = 10)
Train on 1792 samples
Epoch 1/10
1792/1792 [==============================] - 0s 70us/sample - loss: nan
Epoch 2/10
1792/1792 [==============================] - 0s 61us/sample - loss: nan
Epoch 3/10
1792/1792 [==============================] - 0s 61us/sample - loss: nan
Epoch 4/10
1792/1792 [==============================] - 0s 61us/sample - loss: nan

Then I reload the model structure, and load the normal weights trained before the loss became weird, the 'nan' occured again.
Then I extracted another batch of samples from my dataset, the same case occured.
Thus for now I think my data set is right (I use img_string = open(img_path, 'rb').read() to generate TFRecords, it will raise an error if the image is damaged or image path is illegal).

Since usually it is a gradient exploding problem, I tried:
add BatchNormalization layer,
add L2-Norm to the weights,
use Xavier initialization
and pick a smaller learning rate. Here is my new model structure:

def pnet_train2(train_with_landmark = False):

    X = Input(shape = (12, 12, 3), name = 'Pnet_input')

    M = Conv2D(10, 3, strides = 1, padding = 'valid', use_bias = False, kernel_initializer = glorot_normal, kernel_regularizer = l2(0.00001), name = 'Pnet_conv1')(X)
    M = BatchNormalization(axis = -1, name = 'Pnet_bn1')(M)
    M = PReLU(shared_axes = [1, 2], name = 'Pnet_prelu1')(M)
    M = MaxPooling2D(pool_size = 2, name = 'Pnet_maxpool1')(M) # default 'pool_size' is 2!!! 

    M = Conv2D(16, 3, strides = 1, padding = 'valid', use_bias = False, kernel_initializer = glorot_normal, kernel_regularizer = l2(0.00001), name = 'Pnet_conv2')(M)
    M = BatchNormalization(axis = -1, name = 'Pnet_bn2')(M)
    M = PReLU(shared_axes= [1, 2], name = 'Pnet_prelu2')(M)

    M = Conv2D(32, 3, strides = 1, padding = 'valid', use_bias = False, kernel_initializer = glorot_normal, kernel_regularizer = l2(0.00001), name = 'Pnet_conv3')(M)
    M = BatchNormalization(axis = -1, name = 'Pnet_bn3')(M)
    M = PReLU(shared_axes= [1, 2], name = 'Pnet_prelu3')(M)

    Classifier_conv = Conv2D(1, 1, activation = 'sigmoid', name = 'Pnet_classifier_conv', kernel_initializer = glorot_normal)(M)
    Bbox_regressor_conv = Conv2D(4, 1, name = 'Pnet_bbox_regressor_conv', kernel_initializer = glorot_normal)(M)
    Landmark_regressor_conv = Conv2D(12, 1, name = 'Pnet_landmark_regressor_conv', kernel_initializer = glorot_normal)(M)

    Classifier = Reshape((1, ), name = 'Pnet_classifier')(Classifier_conv)
    Bbox_regressor = Reshape((4, ), name = 'Pnet_bbox_regressor')(Bbox_regressor_conv) 
    if train_with_landmark: 
        Landmark_regressor = Reshape((12, ), name = 'Pnet_landmark_regressor')(Landmark_regressor_conv)
        Pnet_output = Concatenate()([Classifier, Bbox_regressor, Landmark_regressor]) 
        model = Model(X, Pnet_output) 
    else:
        Pnet_output = Concatenate()([Classifier, Bbox_regressor])
        model = Model(X, Pnet_output)

    return model

And I change the initial learning rate from 0.001 (which is the default value for adam in Keras) to 0.0001. The same thing happend all the time. Just the process slower since the learning rate smaller...

The last thing that came into my mind is the custom loss I wrote myself. It is specialized for the MTCNN multi-tasking problem. Since the outputs of my model is of shape (17, ) / (batch size, 17), I replaced the loss with 'mse' for checking. It may not get a very good regression result but it should work at least. The same thing happend -- after a few epochs for loss decreasing, the loss and model weights became 'nan'...

For a further think, float32 makes the value between -2,147,483,648 and 2,147,483,647 all available. Considering the former weights are all 10^-2 or something, I think it could be some denominator become ~0 in the loss that causes this problem. But for 'mse', there is no division computation so I am still confused about the real cause.

I am really not sure if this is a bug or some faults I made. So any advice is appreciated. Thanks in advance.

Updated on 2020.04.10 01:57:
I tried to find the epoch where the loss goes wrong. The last epoch that has a normal loss, updates the model weights to 'nan'. Thus I think the problem occurs at back propagation (normal loss (0.0814) - back prop - model weights to 'nan' - the next loss to 'nan'). I used the following codes to get the model weights:

model.fit(x, y, batch_size = 1792, epochs = 1)
Train on 1792 samples
1792/1792 [==============================] - 0s 223us/sample - loss: 0.0814
Out[36]: <tensorflow.python.keras.callbacks.History at 0x205ff7b0188>

temp_weights_list1 = []
for layer in model.layers:

    temp_layer = model.get_layer(layer.name)
    temp_weights = temp_layer.get_weights()
    temp_weights_list1.append(temp_weights)

来源：https://stackoverflow.com/questions/61116002/tensorflow2-tf-keras-loss-and-model-weights-suddenly-become-nan-when-training

标签

tensorflow

keras

deep-learning

gradient

loss