I\'m training a model where the input vector is the output of another model. This involves restoring the first model from a checkpoint file while initializing the second model f
The issue is most certainly happening due to concurrent execution of different session objects. I moved the first model's session from the background thread to the main thread, repeated the controlled experiment several times (running for over 24 hours and reaching convergence) and never observed NaN
. On the other hand, concurrent execution diverges the model within a few minutes.
I've restructured my code to use a common session object for all models.