问题
While running kubeflow pipeline having code that uses tensorflow 2.0. below error is displayed at end of each epoch
W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
Also, after some epochs, it does not show log and shows this error
This step is in Failed state with this message: The node was low on resource: memory. Container main was using 100213872Ki, which exceeds its request of 0. Container wait was using 25056Ki, which exceeds its request of 0.
回答1:
In my case, I didn't match the batch_size
and steps_per_epoch
For example,
his = Test_model.fit_generator(datagen.flow(trainrancrop_images, trainrancrop_labels, batch_size=batchsize), steps_per_epoch=len(trainrancrop_images)/batchsize, validation_data=(test_images, test_labels), epochs=1, callbacks=[callback])
batch_size
in the datagen.flow must correspond to the steps_per_epoch
in Test_model.fit_generator
(actually, I used the wrong value on the steps_per_epoch
)
This is one of the cases for the Error, I guess.
As a result, I think the problem arises when there is wrong correspondence on the batch size and steps(iterations)
Maybe the floats can be a problem when you get the step by dividing...
Check your code about this issue.
Good luck :)
回答2:
In my case: I installed tf-nightly. Now it's working, Though I am new to tensorflow. I followed this link
You can try.
回答3:
I have the same problem. People claimed that warming is superfluous and it has been removed in the tf-nightly, see here. But the memory leak is still there for each epoch.
回答4:
This was due to incompatible CUDA and Tensorflow versions. below versions work well with each other
tensorflow-gpu==2.0.0
tensorflow-addons==0.6.0
nvidia/cuda:10.0-cudnn7-runtime
来源:https://stackoverflow.com/questions/60000573/error-occurred-when-finalizing-generatordataset-iterator-cancelled-operation-w