Tensorflow 2.0 does not iterate through entire dataset when tf.keras.model.fit is called

问题

I am training a model in tf.keras with tensorflow 2.0. I am having an issue where my model appears to train successfully, but it is not iterating through the entire dataset. I restructured the code into tensorflow 1.15, and I do not have this issue in tensorflow 1.x. I am following this tutorial for Multiple Input Series. Below are more details:

I have a time-series dataset. It is very small so I am able to load it into memory, so I do not need the dataset API. I am windowing the time-series to produce two arrays, X and Y, for instance,

X=[
   [[1,2,3],[4,5,6],   [7,8,9]],
   [[4,5,6],[7,8,9],   [10,11,12]],
   [[7,8,9],[10,11,12],[13,14,15]],
   ...
  ] 
Y = [
     [4],
     [7],
     [10],
     ...
    ]

(yes, I realize that I could just as easily only include one of the features and make X=[[[1,2,3]], [[4,5,6]], [[7,8,9]], ...], but I am going to include many features which aren't this perfectly synced when the pipeline works. Also, even when I only include the 1st feature, I still see the problem I describe.)

Then, I build my model:

model = Sequential()
model.add(LSTM(50, activation='relu', input_shape=(n_steps, n_features)))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mse')

and then I train it:

model.fit([X],[Y],num_epochs=300,validation_split=0.2)

It correctly reports the number of train and validation samples, and then the progress bar pops up... but that's where the success stops. The val_loss and val_mean_squared_error is always 0, for every epoch, and it appears to never train more than a fraction (~1/1000) of the windows in my dataset. This is the print out:

Epoch X/300   192/162636 [..............................] - ETA: 45:42 - loss: 0.4783 - mean_squared_error: 0.4783 - val_loss: 0.0000e+00 - val_mean_squared_error: 0.0000e+00

When I execute the same code in tf 1.15, it executes as I expect - the epochs take ~45 minutes (in tf 2.0 they take < 3 seconds), and tf 1.15 reports a legitimate val_loss and val_mean_squared_error. I cannot figure out why the model does not train correctly in tf 2.0. This is my first time I wrote code in tf 2.0/did not migrate from tf 1.13, but all of the legacy code that I upgraded from tf 1.13 to tf 2.0 executed without any errors. None of the legacy code that I migrated had sequential models.

There are no errors, warnings, or info that is reported, it just stops iterating through my dataset early. Does anyone have any insights into the changes in tf.keras.Model.fit in tensorflow 2.0 that could be causing this? Or are there any mistakes in the path that I have taken? Any insight would be HUGELY appreciated. Thanks!

EDIT 11/25:

I have filed a GitHub issue for this bug here. Please see that post for updates on progress, and I'll try to remember to update this post when the issue is resolved.

回答1:

The behaviour you describe is suspicious and sounds a lot like a bug on TF's side. One possible thing you can try is enabling TF2's behaviour in TF 1.15 by calling tf.compat.v1.enable_v2_behavior() right after importing tensorflow. This does a lot of internal changes (honestly, I myself have no clue of what exactly it does, the docs only say "It switches all global behaviors that are different between TensorFlow 1.x and 2.x to behave as intended for 2.x."), this may help you figuring out if the source of the error is somewhere in Tensorflow's implementation, or in your code.

Another possible check I'd do is to make sure you're using tf.keras everywhere (i.e., Tensorflow's implementation of the Keras API), instead of the "standalone" Keras (the one you'd install via pip install keras). The first one is heavily tailored to be compatible with TF and perhaps the second one doesn't quite tolerate the heavy changes between TF1 and TF2 yet, although this is pure speculation.

回答2:

Actually, this is a bug. When you update Keras and TensorFlow, this problem happens. For quick solve, On google collab, first, you should uninstall TensorFlow using:

    pip uninstall tensorflow

Then it says kernel needs a restart, do it. Then you should uninstall Keras too:

    pip uninstall keras

Now you have to install tensorflow v 2.1.0:

    pip install tensorflow==2.1.0

and then install keras v 2.3.1:

    pip install keras==2.3.1

On the old versions, for example, when you are training MNIST dataset (which contains 60,000 training images) on Keras, on the left hand of the progress bar it shows x/60,000 and in each step, it continues as the number of batch size:

Epoch 3/4
60000/60000 [==============================] - 98s 2ms/step - loss: 0.0084 - accuracy: 0.9973 - val_loss: 0.0066 - val_accuracy: 0.9977

However, in newer versions, the number on the left hand of the progress bar is actually the total number of images divided by batch size:

Epoch 3/4
200/200 [==============================] - 92s 461ms/step - loss: 0.2783 - accuracy: 0.3571 - val_loss: 0.2649 - val_accuracy: 0.4344

You must notice that this problem is just not about the number of samples shown on the training, but on some architectures, the overall performance drops significantly with newer versions. The above example is for a classifier, but with the same code, I get totally different results (in this example for epoch 3) and you can see that everything is different. I don't know why this bug is in newer versions, but I hope experts can fix this in the future.

来源：https://stackoverflow.com/questions/58826512/tensorflow-2-0-does-not-iterate-through-entire-dataset-when-tf-keras-model-fit-i

标签

tensorflow

tensorflow2.0

tf.keras