Deep-Learning Nan loss reasons

后端 未结 9 2114
执念已碎
执念已碎 2020-11-28 02:12

Perhaps too general a question, but can anyone explain what would cause a Convolutional Neural Network to diverge?

Specifics:

I am using Tensorflow\'s iris_tra

相关标签:
9条回答
  • 2020-11-28 02:25

    Regularization can help. For a classifier, there is a good case for activity regularization, whether it is binary or a multi-class classifier. For a regressor, kernel regularization might be more appropriate.

    0 讨论(0)
  • 2020-11-28 02:26

    If you're training for cross entropy, you want to add a small number like 1e-8 to your output probability.

    Because log(0) is negative infinity, when your model trained enough the output distribution will be very skewed, for instance say I'm doing a 4 class output, in the beginning my probability looks like

    0.25 0.25 0.25 0.25
    

    but toward the end the probability will probably look like

    1.0 0 0 0
    

    And you take a cross entropy of this distribution everything will explode. The fix is to artifitially add a small number to all the terms to prevent this.

    0 讨论(0)
  • 2020-11-28 02:30

    If using integers as targets, makes sure they aren't symmetrical at 0.

    I.e., don't use classes -1, 0, 1. Use instead 0, 1, 2.

    0 讨论(0)
  • 2020-11-28 02:30

    The reason for nan, inf or -inf often comes from the fact that division by 0.0 in TensorFlow doesn't result in a division by zero exception. It could result in a nan, inf or -inf "value". In your training data you might have 0.0 and thus in your loss function it could happen that you perform a division by 0.0.

    a = tf.constant([2., 0., -2.])
    b = tf.constant([0., 0., 0.])
    c = tf.constant([1., 1., 1.])
    print((a / b) + c)
    

    Output is the following tensor:

    tf.Tensor([ inf  nan -inf], shape=(3,), dtype=float32)
    

    Adding a small eplison (e.g., 1e-5) often does the trick. Additionally, since TensorFlow 2 the opteration tf.math.division_no_nan is defined.

    0 讨论(0)
  • 2020-11-28 02:32

    In my case I got NAN when setting distant integer LABELs. ie:

    • Labels [0..100] the training was ok,
    • Labels [0..100] plus one additional label 8000, then I got NANs.

    So, not use a very distant Label.

    EDIT You can see the effect in the following simple code:

    from keras.models import Sequential
    from keras.layers import Dense, Activation
    import numpy as np
    
    X=np.random.random(size=(20,5))
    y=np.random.randint(0,high=5, size=(20,1))
    
    model = Sequential([
                Dense(10, input_dim=X.shape[1]),
                Activation('relu'),
                Dense(5),
                Activation('softmax')
                ])
    model.compile(optimizer = "Adam", loss = "sparse_categorical_crossentropy", metrics = ["accuracy"] )
    
    print('fit model with labels in range 0..5')
    history = model.fit(X, y, epochs= 5 )
    
    X = np.vstack( (X, np.random.random(size=(1,5))))
    y = np.vstack( ( y, [[8000]]))
    print('fit model with labels in range 0..5 plus 8000')
    history = model.fit(X, y, epochs= 5 )
    

    The result shows the NANs after adding the label 8000:

    fit model with labels in range 0..5
    Epoch 1/5
    20/20 [==============================] - 0s 25ms/step - loss: 1.8345 - acc: 0.1500
    Epoch 2/5
    20/20 [==============================] - 0s 150us/step - loss: 1.8312 - acc: 0.1500
    Epoch 3/5
    20/20 [==============================] - 0s 151us/step - loss: 1.8273 - acc: 0.1500
    Epoch 4/5
    20/20 [==============================] - 0s 198us/step - loss: 1.8233 - acc: 0.1500
    Epoch 5/5
    20/20 [==============================] - 0s 151us/step - loss: 1.8192 - acc: 0.1500
    fit model with labels in range 0..5 plus 8000
    Epoch 1/5
    21/21 [==============================] - 0s 142us/step - loss: nan - acc: 0.1429
    Epoch 2/5
    21/21 [==============================] - 0s 238us/step - loss: nan - acc: 0.2381
    Epoch 3/5
    21/21 [==============================] - 0s 191us/step - loss: nan - acc: 0.2381
    Epoch 4/5
    21/21 [==============================] - 0s 191us/step - loss: nan - acc: 0.2381
    Epoch 5/5
    21/21 [==============================] - 0s 188us/step - loss: nan - acc: 0.2381
    
    0 讨论(0)
  • 2020-11-28 02:34

    I'd like to plug in some (shallow) reasons I have experienced as follows:

    1. we may have updated our dictionary(for NLP tasks) but the model and the prepared data used a different one.
    2. we may have reprocessed our data(binary tf_record) but we loaded the old model. The reprocessed data may conflict with the previous one.
    3. we may should train the model from scratch but we forgot to delete the checkpoints and the model loaded the latest parameters automatically.

    Hope that helps.

    0 讨论(0)
提交回复
热议问题