Keras model params are all “NaN”s after reloading

问题

I use transfer learning with Resnet50. I create a new model out of the pretrained model provided by Keras (the 'imagenet').

After training my new model, I save it as following:

# Save the Siamese Network architecture
siamese_model_json = siamese_network.to_json()
with open("saved_model/siamese_network_arch.json", "w") as json_file:
    json_file.write(siamese_model_json)
# save the Siamese Network model weights
siamese_network.save_weights('saved_model/siamese_model_weights.h5')

And later, I reload it as following to make some predictions:

json_file = open('saved_model/siamese_network_arch.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
siamese_network = model_from_json(loaded_model_json)
# load weights into new model
siamese_network.load_weights('saved_model/siamese_model_weights.h5')

Then I check if the weights look reasonable as following (from 1 of the layers):

print("bn3d_branch2c:\n",
      siamese_network.get_layer('model_1').get_layer('bn3d_branch2c').get_weights())

If I train my network for 1 epoch only, I see reasonable values there..

But if I train my model for 18 epochs (which takes 5-6 hours as I have a very slow computer), I just see NaN values as following:

bn3d_branch2c:
 [array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       ...

What is the trick here?

ADDENDUM 1:

Here is how I create my model.

Here, I have a triplet_loss function that I will need later on.

def triplet_loss(inputs, dist='euclidean', margin='maxplus'):
    anchor, positive, negative = inputs
    positive_distance = K.square(anchor - positive)
    negative_distance = K.square(anchor - negative)
    if dist == 'euclidean':
        positive_distance = K.sqrt(K.sum(positive_distance, axis=-1, keepdims=True))
        negative_distance = K.sqrt(K.sum(negative_distance, axis=-1, keepdims=True))
    elif dist == 'sqeuclidean':
        positive_distance = K.sum(positive_distance, axis=-1, keepdims=True)
        negative_distance = K.sum(negative_distance, axis=-1, keepdims=True)
    loss = positive_distance - negative_distance
    if margin == 'maxplus':
        loss = K.maximum(0.0, 2 + loss)
    elif margin == 'softplus':
        loss = K.log(1 + K.exp(loss))

    returned_loss = K.mean(loss)
    return returned_loss

And here is how I construct my model from start to end. I give the complete code to give the exact picture.

model = ResNet50(weights='imagenet')

# Remove the last layer (Needed to later be able to create the Siamese Network model)
model.layers.pop()

# First freeze all layers of ResNet50. Transfer Learning to be applied.
for layer in model.layers:
    layer.trainable = False

# All Batch Normalization layers still need to be trainable so that the "mean"
# and "standard deviation (std)" params can be updated with the new training data
model.get_layer('bn_conv1').trainable = True
model.get_layer('bn2a_branch2a').trainable = True
model.get_layer('bn2a_branch2b').trainable = True
model.get_layer('bn2a_branch2c').trainable = True
model.get_layer('bn2a_branch1').trainable = True
model.get_layer('bn2b_branch2a').trainable = True
model.get_layer('bn2b_branch2b').trainable = True
model.get_layer('bn2b_branch2c').trainable = True
model.get_layer('bn2c_branch2a').trainable = True
model.get_layer('bn2c_branch2b').trainable = True
model.get_layer('bn2c_branch2c').trainable = True
model.get_layer('bn3a_branch2a').trainable = True
model.get_layer('bn3a_branch2b').trainable = True
model.get_layer('bn3a_branch2c').trainable = True
model.get_layer('bn3a_branch1').trainable = True
model.get_layer('bn3b_branch2a').trainable = True
model.get_layer('bn3b_branch2b').trainable = True
model.get_layer('bn3b_branch2c').trainable = True
model.get_layer('bn3c_branch2a').trainable = True
model.get_layer('bn3c_branch2b').trainable = True
model.get_layer('bn3c_branch2c').trainable = True
model.get_layer('bn3d_branch2a').trainable = True
model.get_layer('bn3d_branch2b').trainable = True
model.get_layer('bn3d_branch2c').trainable = True
model.get_layer('bn4a_branch2a').trainable = True
model.get_layer('bn4a_branch2b').trainable = True
model.get_layer('bn4a_branch2c').trainable = True
model.get_layer('bn4a_branch1').trainable = True
model.get_layer('bn4b_branch2a').trainable = True
model.get_layer('bn4b_branch2b').trainable = True
model.get_layer('bn4b_branch2c').trainable = True
model.get_layer('bn4c_branch2a').trainable = True
model.get_layer('bn4c_branch2b').trainable = True
model.get_layer('bn4c_branch2c').trainable = True
model.get_layer('bn4d_branch2a').trainable = True
model.get_layer('bn4d_branch2b').trainable = True
model.get_layer('bn4d_branch2c').trainable = True
model.get_layer('bn4e_branch2a').trainable = True
model.get_layer('bn4e_branch2b').trainable = True
model.get_layer('bn4e_branch2c').trainable = True
model.get_layer('bn4f_branch2a').trainable = True
model.get_layer('bn4f_branch2b').trainable = True
model.get_layer('bn4f_branch2c').trainable = True
model.get_layer('bn5a_branch2a').trainable = True
model.get_layer('bn5a_branch2b').trainable = True
model.get_layer('bn5a_branch2c').trainable = True
model.get_layer('bn5a_branch1').trainable = True
model.get_layer('bn5b_branch2a').trainable = True
model.get_layer('bn5b_branch2b').trainable = True
model.get_layer('bn5b_branch2c').trainable = True
model.get_layer('bn5c_branch2a').trainable = True
model.get_layer('bn5c_branch2b').trainable = True
model.get_layer('bn5c_branch2c').trainable = True

# Used when compiling the siamese network
def identity_loss(y_true, y_pred):
    return K.mean(y_pred - 0 * y_true)  

# Create the siamese network

x = model.get_layer('flatten_1').output # layer 'flatten_1' is the last layer of the model
model_out = Dense(128, activation='relu',  name='model_out')(x)
model_out = Lambda(lambda  x: K.l2_normalize(x,axis=-1))(model_out)

new_model = Model(inputs=model.input, outputs=model_out)

anchor_input = Input(shape=(224, 224, 3), name='anchor_input')
pos_input = Input(shape=(224, 224, 3), name='pos_input')
neg_input = Input(shape=(224, 224, 3), name='neg_input')

encoding_anchor   = new_model(anchor_input)
encoding_pos      = new_model(pos_input)
encoding_neg      = new_model(neg_input)

loss = Lambda(triplet_loss)([encoding_anchor, encoding_pos, encoding_neg])

siamese_network = Model(inputs  = [anchor_input, pos_input, neg_input], 
                        outputs = loss) # Note that the output of the model is the 
                                        # return value from the triplet_loss function above

siamese_network.compile(optimizer=Adam(lr=.0001), loss=identity_loss)

One thing to notice is that I make all batch normalization layers "trainable" so that BN related params can be updated with my training data. This creates a lot of lines but I could not find a shorter solution.

回答1:

The solution is inspired from @Gurmeet Singh's recommendation above.

Seemingly, weights of trainable layers have become so big after a while during the training and all such weights are set to NaN, which made me think that I was saving and reloading my models in the wrong way but the problem was exploding gradients.

I saw a similar issue in github discussions too, which can be checked out here: github.com/keras-team/keras/issues/2378 At the bottom of that thread in github, it is recommended to use lower learning rates to avoid the problem.

In this link (Keras ML library: how to do weight clipping after gradient updates? TensorFlow backend), 2 solutions are discussed: - using the clipvalue parameter in the optimizer, which simply cuts the calculated gradient values as configured. But this is not the recommended solution to go for.(Explained in the other thread.) - and the second thing is to use the clipnorm parameter, which simply clips calculated gradient values when their L2 norm exceeds the given value by the user.

I also thought about using input normalization (to avoid exploding gradients) but then figured out that it is already done in the preprocess_input(..) function. (Check this link for details: https://www.tensorflow.org/api_docs/python/tf/keras/applications/resnet50/preprocess_input) It is though possible to set the mode parameter to "tf" (set to "caffe" by default otherwise), which could further help (because mode="tf" setting scales pixels between -1 and 1) but I did not try it.

I summary, I changed 2 things when compiling my model that will be trained:

The line that has been changed is the following:

Before the change:

siamese_network.compile(optimizer=Adam(**lr=.0001**), 
                        loss=identity_loss)

After the change:

siamese_network.compile(optimizer=Adam(**lr=.00004**, **clipnorm=1.**),
                        loss=identity_loss)

1) Used a smaller learning rate to make gradient updates a bit smaller 2) Used the clipnorm parameter to normalize calculated gradients and cut them.

And I trained my network again for 10 epochs. The loss decreases as desired, but more slowly now. And I do not experience any problems when saving and storing my model. (At least after 10 epochs (it takes time on my computer).)

Note that I set the value of clipnorm to 1. This means that the L2 norm of gradients is calculated first and if the calculated normalized gradient exceeds the value of "1", the gradient is clipped. I assume this is a hyperparameter that can be optimized, that affects the time needed to train the model while helping to avoid exploding gradients problem.

来源：https://stackoverflow.com/questions/51292212/keras-model-params-are-all-nans-after-reloading

标签

python

tensorflow

machine-learning

keras

transfer-learning