Strange behaviour of the loss function in keras model, with pretrained convolutional base

落爺英雄遲暮 提交于 2019-11-27 16:11:30

Looks like I found the solution. As I have suggested the problem is with BatchNormalization layers. They make tree things 1) subtract mean and normalize by std 2)collect statistics on mean and std using running average 3) train two additional parameters (two per node). When one sets trainable to False, these two parameters freeze and layer also stops collecting statistic on mean and std. But it looks like the layer still performs normalization during training time using the training batch. Most likely it's a bug in keras or maybe they did it on purpose for some reason. As a result the calculations on forward propagation during training time are different as compared with prediction time even though the trainable atribute is set to False.

There are two possible solutions i can think of:

  1. To set all BatchNormalization layers to trainable. In this case these layers will collect statistics from your dataset instead of using pretrained one (which can be significantly different!). In this case you will adjust all the BatchNorm layers to your custom dataset during the training.
  2. Split the model in two parts model=model_base+model_top. After that, use model_base to extract features by model_base.predict() and then feed these features into model_top and train only the model_top.

I've just tried the first solution and it looks like it's working:

model.fit(x=dat[0],y=dat[1],batch_size=32)

Epoch 1/1
32/32 [==============================] - 1s 28ms/step - loss: **3.1053**

model.evaluate(x=dat[0],y=dat[1])

32/32 [==============================] - 0s 10ms/step
**2.487905502319336**

This was after some training - one need to wait till enough statistics on mean and std are collected.

Second solution i haven't tried yet, but i'm pretty sure it's gonna work since forward propagation during training and prediction will be the same.

Update. I found a great blog post where this issue has been discussed in all the details. Check it out here

But dropout layers usually create opposite effect making loss on evaluation less than loss during training.

Not necessarily! Although in dropout layer some of the neurons are dropped, but bear in mind that the output is scaled back according to dropout rate. In inference time (i.e. test time) dropout is removed entirely and considering that you have only trained your model for just one epoch, the behavior you saw may happen. Don't forget that since you are training the model for just one epoch, only a portion of neurons have been dropped in the dropout layer but all of them are present at inference time.

If you continue training the model for more epochs you might expect that the training loss and the test loss (on the same data) becomes more or less the same.

Experiment it yourself: just set the trainable parameter of Dropout layer(s) to False and see whether this happens or not.


One may be confused (as I was) by seeing that, after one epoch of training, the training loss is not equal to evaluation loss on the same batch of data. And this is not specific to models with Dropout or BatchNormalization layers. Consider this example:

from keras import layers, models
import numpy as np

model = models.Sequential()
model.add(layers.Dense(1000, activation='relu', input_dim=100))
model.add(layers.Dense(1))

model.compile(loss='mse', optimizer='adam')
x = np.random.rand(32, 100)
y = np.random.rand(32, 1)

print("Training:")
model.fit(x, y, batch_size=32, epochs=1)

print("\nEvaluation:")
loss = model.evaluate(x, y)
print(loss)

The output:

Training:
Epoch 1/1
32/32 [==============================] - 0s 7ms/step - loss: 0.1520

Evaluation:
32/32 [==============================] - 0s 2ms/step
0.7577340602874756

So why the losses are different if they have been computed over the same data, i.e. 0.1520 != 0.7577?

If you ask this, it's because you, like me, have not paid enough attention: that 0.1520 is the loss before updating the parameters of model (i.e. before doing backward pass or backpropagation). And 0.7577 is the loss after the weights of model has been updated. Even though that the data used is the same, the state of the model when computing those loss values is not the same (Another question: so why has the loss increased after backpropagation? It is simply because you have only trained it for just one epoch and therefore the weights updates are not stable enough yet).

To confirm this, you can also use the same data batch as the validation data:

model.fit(x, y, batch_size=32, epochs=1, validation_data=(x,y))

If you run the code above with the modified line above you will get an output like this (obviously the exact values may be different for you):

Training:
Train on 32 samples, validate on 32 samples
Epoch 1/1
32/32 [==============================] - 0s 15ms/step - loss: 0.1273 - val_loss: 0.5344

Evaluation:
32/32 [==============================] - 0s 89us/step
0.5344240665435791

You see that the validation loss and evaluation loss are exactly the same: it is because the validation is performed at the end of epoch (i.e. when the model weights has already been updated).

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!