Deep-Learning Nan loss reasons

后端未结

关注

 9  2123

执念已碎

Perhaps too general a question, but can anyone explain what would cause a Convolutional Neural Network to diverge?

Specifics:

I am using Tensorflow\'s iris_tra

相关标签:

9条回答

长发绾君心

2020-11-28 02:25

Regularization can help. For a classifier, there is a good case for activity regularization, whether it is binary or a multi-class classifier. For a regressor, kernel regularization might be more appropriate.

0 讨论(0)
发布评论:

提交评论
- 加载中...
萌比男神i

2020-11-28 02:26
If you're training for cross entropy, you want to add a small number like 1e-8 to your output probability.

Because log(0) is negative infinity, when your model trained enough the output distribution will be very skewed, for instance say I'm doing a 4 class output, in the beginning my probability looks like
```
0.25 0.25 0.25 0.25
```
but toward the end the probability will probably look like
```
1.0 0 0 0
```
And you take a cross entropy of this distribution everything will explode. The fix is to artifitially add a small number to all the terms to prevent this.
0 讨论(0)
发布评论:

提交评论
- 加载中...
醉梦人生

2020-11-28 02:30

If using integers as targets, makes sure they aren't symmetrical at 0.

I.e., don't use classes -1, 0, 1. Use instead 0, 1, 2.

0 讨论(0)
发布评论:

提交评论
- 加载中...
失恋的感觉

2020-11-28 02:30
The reason for nan, inf or -inf often comes from the fact that division by 0.0 in TensorFlow doesn't result in a division by zero exception. It could result in a nan, inf or -inf "value". In your training data you might have 0.0 and thus in your loss function it could happen that you perform a division by 0.0.
```
a = tf.constant([2., 0., -2.])
b = tf.constant([0., 0., 0.])
c = tf.constant([1., 1., 1.])
print((a / b) + c)
```
Output is the following tensor:
```
tf.Tensor([ inf  nan -inf], shape=(3,), dtype=float32)
```
Adding a small eplison (e.g., 1e-5) often does the trick. Additionally, since TensorFlow 2 the opteration tf.math.division_no_nan is defined.
0 讨论(0)
发布评论:

提交评论
- 加载中...

没有蜡笔的小新

2020-11-28 02:32

In my case I got NAN when setting distant integer LABELs. ie:

Labels [0..100] the training was ok,
Labels [0..100] plus one additional label 8000, then I got NANs.

So, not use a very distant Label.

EDIT You can see the effect in the following simple code:

from keras.models import Sequential
from keras.layers import Dense, Activation
import numpy as np

X=np.random.random(size=(20,5))
y=np.random.randint(0,high=5, size=(20,1))

model = Sequential([
            Dense(10, input_dim=X.shape[1]),
            Activation('relu'),
            Dense(5),
            Activation('softmax')
            ])
model.compile(optimizer = "Adam", loss = "sparse_categorical_crossentropy", metrics = ["accuracy"] )

print('fit model with labels in range 0..5')
history = model.fit(X, y, epochs= 5 )

X = np.vstack( (X, np.random.random(size=(1,5))))
y = np.vstack( ( y, [[8000]]))
print('fit model with labels in range 0..5 plus 8000')
history = model.fit(X, y, epochs= 5 )

The result shows the NANs after adding the label 8000:

fit model with labels in range 0..5
Epoch 1/5
20/20 [==============================] - 0s 25ms/step - loss: 1.8345 - acc: 0.1500
Epoch 2/5
20/20 [==============================] - 0s 150us/step - loss: 1.8312 - acc: 0.1500
Epoch 3/5
20/20 [==============================] - 0s 151us/step - loss: 1.8273 - acc: 0.1500
Epoch 4/5
20/20 [==============================] - 0s 198us/step - loss: 1.8233 - acc: 0.1500
Epoch 5/5
20/20 [==============================] - 0s 151us/step - loss: 1.8192 - acc: 0.1500
fit model with labels in range 0..5 plus 8000
Epoch 1/5
21/21 [==============================] - 0s 142us/step - loss: nan - acc: 0.1429
Epoch 2/5
21/21 [==============================] - 0s 238us/step - loss: nan - acc: 0.2381
Epoch 3/5
21/21 [==============================] - 0s 191us/step - loss: nan - acc: 0.2381
Epoch 4/5
21/21 [==============================] - 0s 191us/step - loss: nan - acc: 0.2381
Epoch 5/5
21/21 [==============================] - 0s 188us/step - loss: nan - acc: 0.2381

0 讨论(0)

暗喜

2020-11-28 02:34
I'd like to plug in some (shallow) reasons I have experienced as follows:
1. we may have updated our dictionary(for NLP tasks) but the model and the prepared data used a different one.
2. we may have reprocessed our data(binary tf_record) but we loaded the old model. The reprocessed data may conflict with the previous one.
3. we may should train the model from scratch but we forgot to delete the checkpoints and the model loaded the latest parameters automatically.
Hope that helps.
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页